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Preface 



These proceedings collect the papers selected for the 2nd International Confer- 
ence on Interoperating Geographic Information Systems held in Ziirich, Switzer- 
land, 10-12 March, 1999. 

Interoperability has become an issue in many areas of information technology 
in the last decade. Computers are used everywhere, and there is an increasing 
need to share various types of resources such as data and services. This is espe- 
cially true in the context of spatial information. Spatial data have been collected, 
digitized and stored in many different and differing repositories. Computer soft- 
ware has been developed to manage, analyse and visualize spatial information. 
Producing such data and software has become an important business opportu- 
nity. In everyday spatial information handling in many organisations and offices, 
however, interoperability is far from being a matter of fact. Incompatibilities 
in data formats, software products, spatial conceptions, quality standards, and 
models of the world continue to create as synchronicity among constituent parts 
of operating spatial systems. As a follow-up to the first International Conference 
on Interoperating Geographic Information Systems held 1997 in Santa Barbara, 
California, the Interop’99 tries to provide a scientific platform for researchers in 
this area. 

The international program committee carefully selected 22 papers for presen- 
tation at the conference and publication in this volume. Additionally, this vol- 
ume contains three invited contributions by Gio Wiederhold, Adrian Cuthbert 
and Gunther Landgraf. Every paper was sent to three members of the program 
committee and other experts for review. The reviews resulted in a three-day 
single-track conference program that left some room for a few half-day tutorials 
on various topics regarding CIS interoperability. 

Many people supported this event in various ways. We would like to express 
our thanks to the members of the program committee and the additional review- 
ers for their support in selecting the papers presented in this volume. Caroline 
Westort provided a lot of help handling all communication issues of Interop’99, 
as did Doris Wild with the organization of the conference. Hansrudi Noser sup- 
ported many authors by translating their contributions into proper IM5;]Xcode. 

We would also like to thank our sponsoring institutions for the various ways 
in which they provided support. 
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1 Introduction 

The objective of interoperation is to increase the value of information when 
information from multiple sources is accessed, related, and combined. However, 
care is required to realize this benefit. One problem to be addressed in this 
context is that a simple integration over the ever-expanding number of resources 
available on-line leads to what customers perceive as information overload. In 
actuality, the customers experience data overload, making it nearly impossible 
for them to extract relevant points of information out of a huge haystack of data. 

Information should support the making of decisions and actions. We distin- 
guish Interoperation of Information from integration of data and databases, since 
we do not expect to combine the sources, but only selected results derived from 
them [27]. If much of the data obtained from the sources is materialized, then 
the integration of information overlaps with the topic of data warehousing [28]. 
In the interoperation paradigm we favor that the merging is performed as the 
need arises, relying on articulation points that have been found and defined ear- 
lier [13]. If the base sources are transient, a warehouse can provide a suitable 
persistent resource. 

Interoperation requires knowledge and intelligence, but increases substanti- 
ally the value to the consumer. For instance, domain knowledge which combines 
merchant ship data with trucking and railroad information permits a customer 
to analyze and plan multi- modal shipping. Interoperating over multiple, distinct 
information domains, as shipping, cost-of-money, and weather requires broader 
knowledge, but will further improve the value of the information. Consider here 
the manager who deals with delivery of goods, who must combine information 
about shipping, the cost of inventory that is delayed, and the effects of weather 
on possible delays. This knowledge is tied to the customer’s task model, which 
provides an intersecting context over several source domains. 

The required value-added service tasks, as selection of relevant and high- 
quality data, matching of source data, creating fused data objects, summarizing 
and abstracting, fall outside of the capabilities of the sources, and are costly to 
implement in individual applications. The provision of such services requires an 
architecture for computing systems that recognizes their intermediate functiona- 
lity. In such an architecture mediating services create an opportunity for novel 
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on-line business ventures, which will replace the traditional services provided by 
consultants, analysts, and publishers. 



2 Architecture 

We define the architecture of a software system to be the partitioning of a system 
into major pieces or modules. Modules will have independent software operation, 
and are likely located on distinct, but networked hardware as well. Criteria for 
partitioning are technical and social. The prime technical criterium is having a 
modest bandwidth requirement across the interfaces among the modules. The 
prime social criterium is having a well-defined domain for management, with 
local authority and responsibilities. Luckily, these two criteria often match. It is 
now obvious that building a single, integrated system for any substantial enter- 
prise, encompassing all possible source domains and knowledge about them is 
an impossible task. Even abstract modeling of a single enterprise in sufficient 
detail has been frustrating. When such proposals were made in the past, the 
scope of information processing in an enterprise was poorly understood, and 
data-processing often focused on financial management. Modern enterprises use 
a mix of public market and service information in concert with their own data. 
Many have also delegated data-processing, together with profit-and-loss respon- 
sibilities, to smaller units within their organizations. An integrated system wa- 
rehousing all the diverse sources would not be maintainable. Each single source, 
even if quite stable, will still change its structure every few years, as capabilities 
and environments change, companies merge, and new rate-structures develop. 
Integrating hundreds of such sources is futile. 

Today, a popular architecture is represented by client-server systems (Fi- 
gure 1). Simple middleware as CORBA and COM [24], provides communication 
among the two layers. However, these 2-layer systems do not scale well as the 
number of available services grows. While assembly of a new client is easy if all 
the required services exist, if any change is needed in an existing service to ac- 
commodate the new client, a major maintenance problem arises. First of all, all 
other clients have to be inspected to see if they use any of the services being up- 
dated, and those that do have to be updated when the service changes, in perfect 
synchrony. Scheduling the change-over to a data that is suitable for the affected 
clients induces delays. Those delays in turn cause that other updates needs arise, 
and will have to be inserted on that same day. The changeover becomes a major 
event, costly and risky. 

Hence, dealing with many, say hundreds of data servers entails constant chan- 
ges. A client-server architecture of that size will likely never be able to serve the 
customers. To make such large systems work, an architectural alternative is re- 
quired. We will see that changes can be gradually accommodated in a mediated 
architecture, as a result of an improved assignment of functions. 
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Fig. 1. A Client-Server Architecture 



2.1 Mediator Architecture 

The mediator architecture envisages a partitioning of resources and services in 
two dimensions, as shown in Figure 2 [42]: 

1 . horizontally into three layers: the client applications, the intermediate service 
modules, and the base servers. 

2. vertically into many domains: for each domain, the number of supporting 
servers is best limited to 7 ± 2 [33] 

The modules in the various layers will contribute data and information to each 
other, but they will not be strictly matched (i.e., not be stovepiped). The verti- 
cal partitioning in the mediating layer is based on having expertise in a service 
domain, and within that layer modules may call on each other. For instance, 
logistics expertise, as knowledge about merchant shippers, will be kept in a 
single mediating module, and a superior mediating module dealing with shared 
concepts about transportation will integrate ship, trucking, and railroad infor- 
mation. At the client layer several distinct domains, such as weather and cost of 
shipping, will be brought together. These domains do not have commensurate 
metrics, so that a service layer cannot provide reliable interoperation (Figure 
3). The client layer and, in it, the logistics customer, has to weigh the combina- 
tion and make the final decision to balance costs and risks. Similarly, a farmer 
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Fig. 2. A Mediated Architecture 



may combine harvest and weather information. Moving the vagueness of combi- 
ning information from dissimilar domains to the client layer reduces the overall 
complexity of the system. 



2.2 Task Assignment 

In a 2-layer client-server architecture all functions had to be assigned either 
to the server or to the client modules. The current debates on thin versus fat 
clients and servers illustrate that the alternatives are not clear, even though 
some function assignments are obvious. With a third, intermediate layer, which 
mediates between the users and the sources, many functions, and particularly 
those that add value, and require maintenance to retain value, can be assigned 
there. We will review those assignments now. 



Server. Selection of data is a function which is best performed at the server 
since one does not want to ship large amounts of unneeded data to the client 
or the mediator. The effectiveness of the SELECT statement of SQL is evidence 
of that assignment; not many languages can make do with one verb for most 
of their functionality. Making those data accessible may require a wrapper at or 
near the server, so that access can be performed using standard interfaces. 
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Application 

• Informal, pragmatic 

• User-control 



Mediation 

• Formal service 

• Domain-Expert control 




Fig. 3. Formal and Pragmatic Interoperation 



Client. Interaction with the user is an obvious function for the clients. Local 
response must be rapid and reliable. Adaptation to the wide variety of local 
devices is best understood and maintained locally. For instance, moving from 
displays and keyboards to voice output and gesture input requires local feedback. 
Images and maps may have to be scaled to suit local displays. When maps are 
scaled, the labeling has to be adjusted [3]. 



Mediator. Not suitable for assignment to a server nor to a client are functions 
as the integration of data from multiple servers and the transformation of those 
data to information that is effective for the client program. Requiring that any 
server can interoperate with any other possible relevant server imposes require- 
ments that are hard to establish and impossible to maintain. The resulting 
complexity is obvious. Similarly, requiring that servers can prepare views for any 
client is also onerous; in practice the load of adaptation would fall on the client. 
To resolve this issue of assignment for interoperation we define an intermediate 
layer, and establish modules in that layer, which will be referred to as mediators. 
The next section will deal with such modules, and focus on some of the roles 
that arise in geographic-based processing. 
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3 Mediators 



Interoperation with the diversity of available sources requires a variety of func- 
tions. The mediator architecture has to accommodate multiple types of modules, 
and allow them to be combined as required. For instance, facilitators will search 
for likely resources and ways to access them [44]. To serve interoperation, related 
information that is relevant for the domain has to be selected and acquired from 
multiple sources. Query processors will reformulate an initial query to enhance 
the chance of obtaining relevant data [5,12]. Text associated with images can be 
processed to yield additional keys [22]. Selection then obtains potentially useful 
data from the sources, and has to balance relevance with cost of moving the data 
to the mediator. After selection, further processing is needed for integration and 
making the results relevant to the client. In this exposition we will focus on is- 
sues that relate to spatial information and focus on two topics, integration and 
transformation. The references given can be used to explore other areas. 



3.1 Integration 

Selection from multiple sources will obtain data that is redundant, mismatched, 
and contains excessive detail. Web searches today demonstrate these weaknesses, 
they focus on breadth of selection and leave the extraction of useful information 
to the user. 



Omitting redundancy. When information is obtained from a broad selection 
of sources, as on the web, redundancy is unavoidable. But since sources often 
represent data in their own formats, omitting overlaps has to be based on simi- 
larity assessment, rather than on exact matches [20]. When geographic regions 
overlap, the sources that are most relevant to the customer in terms of content 
and detail are best. Assessing the similarity of images requires new technologies, 
wavelets appear to be promising [11]. 

Quality of data is a complementary issue. A mediator may have rules as 
‘Source A is preferable over Source B’, or ‘more recent data are better’, but so- 
metimes differences of data values obtained cannot be resolved at the mediating 
level, because the metrics for judgement are absent. If the differences are signi- 
ficant, both results, identified with their sources can be reported to the client 
[!]• 



Matching. Integration of information requires matching of articulation points, 
the identifiers that are used to link entities from distinct sources. Matching of 
data from sources is based mainly on terms and measures. We now have to link 
complementary information, say text and maps. When sources use differing ter- 
minologies we need ontological tools to find matching points for their articulation 
[13]. 
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While articulation of textual information is based on matching of abstract 
terms, when systems need to exchange actual goods and services, physical pro- 
ximity is paramount. This means that for problems in logistics, in military plan- 
ning, in service delivery, and in responding to natural disasters geographic mar- 
kers are of prime importance. 

Georeferencing. Unfortunately, the representation of geographic fiducial points 
varies greatly among sources and their representations. We commonly use na- 
mes to denote geographic entities, but the naming differs among contexts. Even 
names of major entities, as countries, differ among respected resources. While 
the U.N. web pages refer to ’’The Gambia”, most other sources call the country 
simply ’’Gambia”. If we include temporal variations then the names of the com- 
ponents of the former USSR and Yugoslavia induce more complexity. Based on 
current sources we would not be able to find in which country the 1984 Winter 
Olympics were held [25]. When native representations use differing alphabets 
another level of complexity ensues. 

The problems get worse at finer granularity. Names and extents of towns and 
roads change over time, making global information unreliable. For delivery of 
goods to a specific loading dock at a warehouse local knowledge becomes essen- 
tial. Such local knowledge must be delegated to the lowest level in the system 
to allow responsive maintenance and flexibility. In modern delivery systems, as 
those used by the Federal Express delivery service, the driver makes the final 
judgement and records the location as well as the recipient. 

Using latitude and longitude can provide a common underpinning. The wide 
availability of GPS has popularized this representation. Whiled commercial GPS 
is limited to about 100 m precision, the increasing capabilities of ground-based 
emitters (pseudolites), used in combination with space-based transmitters can 
conveniently increase the precision to a meter, allowing, for instance, the mat- 
ching of trucks to loading gates [17]. The translations required to move from 
geographical named areas and points to areas described by vertices is now well 
understood, although remains sufficiently complex that mediators are required 
to offload clients from performing such transformations. 

Matching interacts with selection, so that the integration process is not a 
simple pipeline. The initial data selection must balance breadth of retrieval with 
cost of access and transmission. After matching retrieval of further articulated 
data can ensue. To access ancillary geographic sources the names or spatial para- 
meters used as keys must be used. When areas are to be located circumscribing 
boxes must be defined so that all possibly relevant material is included, and the 
result be filtered locally [19]. Again, many of these techniques are well under- 
stood, but require the right architectural setting to become available as services 
to a larger user population [16]. 

3.2 Transformation 

Integration brings together information from autonomous sources, and that me- 
ans also that data is represented at differing levels of detail. For instance, geogra- 
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phic results must be brought into the proper context for the application domain. 
Often detailed data must be aggregated to a higher level of granularity. For in- 
stance, to assess sales in a region, detailed data from all stores in the region must 
be aggregated. The aggregation may require multiple hierarchical levels, where 
postal codes and town names provide intermediate levels. Such a hierarchy can 
be modeled in the mediator, so that the client is relieved from that computation. 
The summarization will also reduce the volume data, relieving the network and 
the processors from high demands. 



Summarization. The actual computation of quantitative summaries can again 
be allocated to the source, to the mediating layer, or to the client. Languages 
used for server access, such as SQL, provide some means for grouping and sum- 
marization, although expressing the criteria correctly is difficult for end-users. 
Warehouse and data-mining technology is addressing these issues today [2], but 
supporting a wide variety of aggregation models with materialized data is very 
costly. The mediator can use its model to drive the computation. However, server 
capabilities may be limited. Even when SQL is available, the lack of an operator 
to compute the variance, complementing the AVERAGE operator also motivates 
moving aggregating computations out of the server. While in 90% of the cases 
the average is a valid descriptor of a data set, not warning the end-user that the 
distribution is far from normal (bi-modal or having major outliers) is fraught 
with dangers in misinterpretation. Knowledge encoded in a mediator can provide 
warnings to the client, appropriate to the type of service being provided, that 
the data is not trustworthy. 

While numeric processing for summarization is well understood, dealing with 
other data types is harder. We now have experimental abstractors that will 
summarize text for customers [29]. Such summarizations may also be cascaded 
if the documents can be placed into a hierarchical customer structure [39] . 

Aggregation may also be required prior to integrating data from diverse sour- 
ces. Autonomous sources will often differ in detail, because of differing infor- 
mation requirements of their own clientele. For instance, cost data from local 
schools must be aggregated to county level before it can be compared with other 
county budget items. The knowledge to perform aggregation to enable matching 
is best maintained by a specialist within the school administration; at the county 
budgeting level it is easy to miss changes in the school system. Other transfor- 
mations that are commonly needed before integration can be performed are to 
resolve temporal inconsistencies [18], or context differences, as seen in data about 
countries collected from different international agencies [25]. 

The complexity of these transformations is such that they are not appropriate 
for assignment to the client. Transformations performed on results of integration 
can, of course, not be assigned to servers. 

Object-structuring. Anyone using the web today can attest to the comple- 
xity that linearly presented data imposes on the customer who seeks relevant 
information. Most clients are best served by structuring their information in 
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object-oriented form. That means not only carrying forward the top-level sum- 
marization, but also the details that contribute to the summaries. Structural 
modeling tools can transform relational source data into diverse object-oriented 
formats, as needed by the client [6]. The base model can cover multiple sources. 

Differing contexts require alternate hierarchies. In geography we distinguish 
political, social, topographical, and other hierarchies. While geographically-based 
hierarchies are common, other aggregations may be based on social criteria, as 
income or age of customers. Layering of geographic criteria and social criteria is 
also common. 



Digital Libraries. Related research is being performed within the Digital Li- 
brary Project, supported by NSF, DARPA, and NASA. For publications as jour- 
nals and books mediating selection services were traditionally provided through 
reviewers and editors, while libraries, through their indexers, local storage capa- 
bilities, provided dissemination services to the clients. The technical challenge in 
automating the process is again dealing with the lack of common structure [23], 
heterogeneity of sources [35], and the redundancy [40] in the source data. For 
geographic libraries the base material is graphics and images, identified by rela- 
ted text [41]. There are many opportunities for innovative value-added services 
in this area [8]. 



3.3 Interfaces 

For building and maintaining multi-layer systems, interface standards are cru- 
cial. When legacy files can be structured into tables, SQL will become the access 
language, as is being done by many extensions of relational system [9]. In ad- 
dition to accepted standards for data, as SQL, ODL [10], and CORBA [36], a 
number of new interfaces have appeared. For instance, a transmission protocol 
for knowledge and data querying and manipulation being used in related re- 
search is KQML [30]. KQML provides for specification of the ontology being 
used in a transmission, to assure that the contents can be understood by com- 
municating modules. Currently XML is gathering much momentum [14]. When 
data cannot be structured well, the XML format provides an alternative. Such 
semi-structured data have been the topic of much recent research [38]. XML 
structures can be defined for specific domains, using domain-specific type de- 
scriptions (DTD). Those DTDs will be developed by specialists, and will help in 
matching the meaning of the information being shared. 

The alternative server-based technology, provided by pure Java, does allow 
uploading of functions to the client, but maintaining support for all user ap- 
plications in the server or mediator is costly, as is shipping of all presentation 
alternatives for all client types. Furthermore, since we envision that pragmatic 
integration and processing will occur in the client, we must transmit information 
in a form suitable for further processing, and not just for display. As a language, 
however, Java is attractive, and we are likely to see Java programs in the client 
interoperating with Java-beans in the mediators. 
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Many new conventions are being considered for standardization, which will 
provide stability, and solidify market share. However, it is wise to wait before 
imposing any such standards on the community until adequate practical expe- 
rience exists. It remains an open question how beneficial researcher involvement 
in the standards development process will be, but researchers will certainly be 
affected by the outcomes [31]. 



4 Status 



Capabilities for data collection are increasing rapidly, advances in communi- 
cations accelerate the flow, the situations that the clients must deal with are 
increasingly varied. Military intelligence systems were among the first users of 
this technology, even before solid research results were obtained and documen- 
ted. Fusion of sensor data and images was already common. Geographic systems 
were integrated in several of these systems, but the interfaces to other data 
sources are still not very smooth. 

Most operational mediating systems have been explicitly programmed. This 
means that the knowledge the mediators embody is in the form of computer 
codes. Moving to more formal descriptions is the objective of much current de- 
velopment. Building new systems can become more effective if there is reuse of 
technology and knowledge [34]. Use of rules makes the mediator easier to ma- 
nage, important when the number of potential sources is large [37]. The leverage 
offered by modest, domain-specific knowledge bases should be substantial, but 
still has to be proven. In geography, such concepts have been proposed, but their 
use for interoperation has not yet been shown [7]. 

As software suppliers gain experience there will be spinoffs into pure commer- 
cial work [15]. An early example is the use of matchmaking mediators leading now 
to application in the Lockheed-sponsored venture for distribution of space satel- 
lite images [32] . A list of software suppliers was prepared for [45] and is maintai- 
ned in related web pages (http : //www-db . Stanford. edu/LIC/mediator . html). 



4.1 Effectiveness 

Commercial dissemination of mediating modules will only occur if the informa- 
tion service paradigm proves to be effective. Interposition of a mediating layer 
into the client- server model incurs costs. A system’s performance cost may be 
offset through reduction in transmitted data volume, as the information density 
increases. 

Crucial benefit/cost ratios are in balancing service quality and system main- 
tenance [43]. The bane of artificial intelligence technology has been the cost- 
versus- benefit of knowledge maintenance. Mediation provides a focus for such 
maintenance, in divorcing it from the operational pressures at the servers and 
the short-range needs at the clients, as shown in Figure 4. Reduced long-term 
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maintenance costs may become the most powerful driver towards the use of me- 
diating technologies, since software maintenance absorbs anywhere from 60% to 
90% of computing budgets, and increases disproportionally with scale. 
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Fig. 4. Mediation assigns responsibility for maintenance 



4.2 Privacy 

Interoperation, while adding value, also adds risks. Combining information from 
multiple sources, aided by helpful agents that retrieve relevant information which 
was not directly requested, increases the risk of violation of individual and com- 
mercial privacy. Issues of privacy protection [26] and security must be addressed 
if broad access to valuable data is to become commonplace. A project on secu- 
rity mediation focuses on this issue [46] . A security mediator is a distinct module 
in an enterprise firewall, which complements traditional access protection with 
mechanisms to filter results before releasing them to the outside world. In a se- 
curity mediator the owner is the security officer in charge of an organizationally 
defined domain [21]. 
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5 Research 

Having a need itself is not an adequate motivation for research investment; there 
also has to be a reasonable hope of moving towards solutions. In many areas, say 
in dealing with problems of strife and hunger, we are frustrated by complexity 
and a lack of leverage points. Providing information to agencies to effectively 
marshal and deploy their resources is a motivation for our research. Finding the 
right balance of the possible and the ideal is the major strategic issue in defining 
fundamental research. A tactical issue is finding the right time-point. 

Research to solve problems that industry recognizes tends to be futile for aca- 
demics. Industry will be able to devote sufficient resources to provide adequate, 
focused solutions. If academics can determine what solutions industry will ad- 
opt, then there are opportunities to go beyond. Going beyond can involve depth 
or breadth. Going in depth may mean dealing with likely omissions. In integra- 
tion that might be providing for translation of terms that do not match, but 
not providing the triggers when domains change so that translations have to be 
updated. Going breadth in the same problem domain may mean devising rules 
that can work for multiple domains, rather than for some specific translation. 

These tasks in information generation are complex and have to be adaptable 
to evolving needs of the customers, to changes in data resources, and to upgrades 
of system environments. The number of research issues needing solutions in the 
field is great. 



5.1 Semantics 

As the technical and syntactical problems of interoperation are being dealt with 
in industry, the semantic issues come to the forefront. Data resources, and es- 
pecially databases, carry implicit or explicit domain definitions — no database 
customer expects a merchant shipping database to deal with interest rates. Simi- 
larly, a financial database is expected to ignore details about ships and a weather 
database is innocent of both. In all three domains the knowledge needed to ade- 
quately describe the data is manageable, but great leverage is provided by the 
many ground instances that knowledge-based rules can refer to. 



5.2 Alternate Sources 

While integration started out in dealing with well-structured databases, much 
current focus in on semi-structured data, and the textual contents of those data. 
Images, maps and graphs are brought in mostly through associated keys. Gontent 
analysis of these sources is making progress, and will become input to data 
integration. Video and speech are being analyzed as well, their volume makes 
integrated delivery to clients more problematical. 

For planning and decision-making results from simulations also need to be 
integrated [4]. That will allow the clients not only to view the past, but also 
extrapolate timelines into the future [47]. Today this function is left wholly to 
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the clients, and the tools they have, as spreadsheets, are not well integrated into 
their processing systems. 

In the meantime, mediated systems are being built where alternatives are not 
feasible, for instance, where source data is hidden in legacy systems that cannot 
be converted, or where the planning cycle needed for data system integration is 
excessive. 
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Fig. 5. Moving towards a New Science 



6 Background 

Starting in 1992 the Advanced Research Projects Agency (ARPA, now DARPA), 
the agency for joint research over all service branches of the U.S. Department of 
Defense, initiated a new research program in Intelligent Integration of Informa- 
tion (13). Many results cited in this paper were initiated with ARPA support. 
Later research presented here was supported by NSF CISE and AFOSR. We 
thank the many participants and students who have helped in developing and 
realizing the concepts presented here. 
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7 Conclusion 

Mediated systems are still in their infancy. We hope that ongoing development 
and deployment will fuel an effective research cycle. Having a clean architecture 
allows also a partitioning of research tasks, since the overall problem presented by 
information systems is greater than any single project can handle. Interoperation 
will require a variety of articulation points among sources and domain-specific 
knowledge. 

The architecture we presented allows multiple application hierarchies to be 
overlaid, so that the structure forms a directed acyclic graph from client to 
resource, although the information flow is in the opposite direction. The com- 
plexity is still an order less than that implied by arbitrary networks, simplifying 
composition both in terms of research and operational management. The final 
vision is summarized in Figure 5, indicating the inputs we have in order to move 
towards a new science, focusing on integration of capabilities and competencies 
needed to make large systems work. 
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Abstract. The OpenGIS Consortium, OGC, has over one hundred and 
hfty members drawn from both the user and vendor communities and 
seeks to develop specihcations for providing interoperability for geospa- 
tial data access and geoprocessing. This paper unashamedly adopts a 
biased perspective, that of a vendor active in the OGC. It attempts to 
explain the importance of the OGC in raising issues that go far beyond 
the writing of specifications. 

Unlike many of the papers included here, the subject of this paper does 
not readily submit to a rigorous academic analysis. Issues are argued by 
example rather than by proof. To many vendors, the worth of the OGC 
lies in its recognition of key commercial realities. Since the OGC does 
concern itself with implementation and because it is trying to use the 
best of emerging technologies but is not tied to one particular platform, 
it faces many of the same problems that a vendor encounters. It is fa- 
miliar with the compromise and pragmatism required to make progress. 
Consequently it provides one of the few forums where these issues are 
discussed. 

This paper gives time to underlying issues that, although raised in the 
commercial world, impinge directly on technical developments. Many of 
these issues remain current and deserve a wider audience. They represent 
tales of the OpenGIS process, both past and present, told by a vendor 
located in a small market town. 



1 Introduction 

1.1 What Makes Its So Difficult? 

The OpenGIS Consortium, 0GC[1], was founded in the belief that new and 
emerging technologies could fundamentally alter the way in which geospatial 
data and geoprocessing could be accessed. Technologies, such as component soft- 
ware, provide new ways of splitting up old problems. The OGC is not in existence 
to develop these new technologies, merely to harness the potential of those that 
emerge from the Information Technology mainstream. Component architectures 
include the promise of solutions built from best-of-breed components. However 
if such components are to work together they need to do more than share the 
same technology base, they need to agree on some core set of specifications. 
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Clearly the development of such specifications is far too important a task 
to be left to individual vendors. Ideally an open process should control it. The 
OGC provides such a process. Thus the OGC embodies a desire to use the latest 
technology and a desire to develop pragmatic specifications through consensus. 
It also includes the natural dynamic that exists as a result of combining the two. 

The constitution of the OGC most closely resembles that of the Object Ma- 
nagement Group, OMG[2]. The OMG is a consortium of many hundreds of orga- 
nizations committed to providing a distributed computing environment. However 
that intent reveals a marked difference between the two. The OMG seeks to de- 
velop a technology, the OGC seeks to exploit technological developments. This 
leaves the OGC more susceptible to changes in the underlying technology base. 

Indeed one of the major difficulties that the OGC has is that it cannot take 
refuge in a dogmatic adherence to its own developments. It does not yet possess 
an established base of users that can command industry to support it. This might 
be contrasted with the situation enjoyed by Java, where Sun Microsystems Inc. [3] 
can count on the world Java community [4] to make the development of new 
APIs highly desirable. The OGC must necessarily be responsive to the changes 
in the environment. It is in this regard that it faces many of the challenges 
faced by vendors, and therein lies its worth. As well as the problem of defining 
specifications it must constantly anticipate and analyze developments in the 
wider IT environment. 



1.2 Specifications Based on Interfaces 

The OGC provides an open process for defining specifications designed to help 
promote interoperability between geospatial data and geoprocessing. These spe- 
cifications allow the development of multi-tier solutions. The ‘open’ in OpenGIS 
refers to the fact that these tiers need not all be implemented by the same ven- 
dor/developer. When one is considering a single specification, it is common to 
distinguish between the two tiers involved as ‘client’ and ‘server’. This termino- 
logy is used through out this paper, but should not be taken to indicate that the 
entire application conforms to a conventional client-server architecture. 

The OGC was founded in August 1994, when the technological wave was 
component software. Technologies such as the OMG’s Common Object Request 
Broker Architecture (CORBA)[5] and Microsoft’s Component Object Model 
(COM) [6] promised platform and language neutral mechanisms for plugging 
components together. These represent an object-oriented generalization of an 
approach pioneered by SQL. The OGC refers to these as Distributed Compu- 
ting Platforms, DCPs. These are the technologies against which specifications 
can be written. 

Both COM and CORBA are broker based solutions that allow the definition 
of ‘interfaces’ using a language neutral Interface Definition Language. From this 
both stub and skeleton (or client and server) code can be generated for a va- 
riety of programming languages. The broker transparently allows different tiers 
of a solution to communicate, whether it be between different languages and/or 
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different processes/machines. The OGC refers to these as ‘interface’ based solu- 
tions. With a good mapping between interface and programming language, this 
approach provides a near seamless joining of tiers. For example exceptions can 
be propagated from one tier to another. In addition these solutions provide a 
natural mechanism to handle state. A reference from one tier to an object in 
another tier provides the mechanism for one tier to retain state in the other. 

However there are alternative approaches on which one might base a DCP. 
One might consider ‘message’ based solutions. In this approach communication 
between tiers is entirely via messages that conform to a structure that can be 
decoded. This means that the approach is not only language and platform neu- 
tral, it is also much less reliant on a common infrastructure shared between the 
tiers. About all that is required is an ability to send and receive messages. Ne- 
vertheless this approach does not provide the deep integration with languages 
provided by the interface approach, nor does it do a good job of retaining state 
between communicating tiers. Typically all the information required to define 
the request must be encoded in the out-going message. 

1.3 Do Interface Based Solutions Have a Future? 

In the current climate of technological change, there is an on-going responsibility 
to review fundamental assumptions. The OGC began with an assumption that 
there were a small number of DCPs that could be exploited. Since then two 
developments challenge that view: Java and the Internet. 

— Java is increasingly being presented as a platform in its own right. The 
problem is not that Java would represent a language specific DCP, rather 
that it increases the number of DCPs to be supported. Without ‘bridges’ 
between different DCPs, supporting multiple DCPs becomes increasingly 
untenable. 

~ The characteristics of the Internet, most notably its latency, require chan- 
ges in emphasis. Most obviously there is an on-going need to minimize the 
number of communications between client and server. Part of the solution is 
to be able to communicate more complex, structured data. 

There are features found in Java that are not provided by either COM or 
CORBA. The most obvious examples are its ability for introspection, the ability 
to discover and dynamically load new class definitions and, as a consequence, 
the ability of an object to serialize and distribute itself. Those who wish to take 
advantage of these features would prefer to see Java regarded as a platform in 
its own right. 

By contrast, advocates of a message-based approach are looking to reduce the 
dependence on platform specific capabilities. They say that the dual problems 
raised by Java and the Internet can be solved be together; the Internet is pro- 
viding the tools to encode arbitrarily complex structured data (for example the 
extensible Markup Language, XML) [7] that can be used in messages between 
tiers that may be running on different platforms. 




20 



A. Cuthbert 



While it is true that CORBA provides a mechanism to describe complex 
structures, the manipulation of these in a programming language rapidly betrays 
its CORBA origins. Nor does the inclusion of a CORBA based broker as part 
of the Java 2 specification irredeemably tie Java and CORBA together. There is 
still much work required to make the interaction between Java and CORBA as 
seamless as many would like. For example, although COM has always provided 
for classes supporting multiple interfaces, CORBA will only provide support for 
this in CORBA 3. 



2 Platforms 

2.1 How Much Functionality Should Be Included? 

The OGC is committed to providing interoperability not only to access of ge- 
ospatial data but also for the processing of geospatial data. Consider a simple 
scenario: 

a server is accessible through an OpenGIS conformant interface. Data 
from the server has been displayed on the screen and two features, each 
with an area geometry attribute, have been identified interactively. The 
application is designed to highlight the area common to both features, 
if any. We might reasonably expect a system that maintains and can 
retrieve geometric data, a system that is able to perform queries with 
a spatial constraint, to handle all our geometric requirements. In this 
example the requirement is to generate the area representing the inters- 
ection of two other areas. 

The question that must be answered by those that write specifications is 
‘where to stop?’. From the client’s perspective a large and all encompassing 
specification may be desirable, but it is of little worth unless it can be agreed on 
and delivered. This makes vendors a relatively conservative group. A minimum 
requirement might be stated as: 

It should not be necessary for a client to develop functionality that al- 
ready, albeit implicitly, exists within the server. 

For example if a server is capable of performing a spatial query that involves 
a test for whether geometries overlap, then it should be possible, given two geo- 
metries, to test explicitly whether they overlap. However, if the server cannot 
query on the area represented by the overlap, then one would not expect to be 
able to generate a new geometry representing the area of overlap. The purpose 
of this requirement is to avoid situations where a client must implement func- 
tionality implicitly available in the server. Not only does this reduce duplicated 
effort but also it minimizes the risk of incompatible implementations. 
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2.2 What about Simple Features? 

At the time of writing (December 1998) the only specification adopted by the 
OGC was that to provide access to ‘simple’ features[8]. The ‘simple’ in the title is 
something of a misnomer and merely indicates recognition that the specification 
does not cover access to all aspects of a feature. The specification is provided 
in different ‘flavors’, one for each of three DCPs. Formally the specifications are 
titled: 

The OpenGIS Simple Features Specification for 

- GOM 

- GORBA 

- SQL 

The decision to develop specifications for a number of identified DGPs was 
taken at the time that proposals were requested. All the DGPs have the charac- 
teristics of being both language neutral and of being able to support different 
tiers on different machines. 

SQL has demonstrated such strengths for many years. Furthermore the emer- 
gence of the object-relational model required the SQL specification to have two 
variants; for ‘SQL92’ and for ‘SQL92 with Geometry Types’ [8]. In the object- 
relational model, the columns in a table are no longer restricted to simple types, 
but may include more complex ‘objects’. The inclusion of objects into relational 
databases allows them to offer unlimited functionality and ensures their position 
as a viable DGP. 

The ‘SQL92 with Geometry Types’ variant acknowledges the emergence of 
vendor specific approaches to the object-relational model in advance of stan- 
dards; for example Informix with its Datablade technology and Oracle with its 
Gartridge technology. SQL-3 [9] seeks to support these innovations through its 
introduction of Abstract Data Types, ADTs. Gurrently the OGG is involved in a 
harmonization effort between future versions of ‘The OpenGIS Simple Features 
Specification for SQL’ and the ‘multimedia’ version of SQL-3, SQL-3/MM. 

All the OpenGIS Simple Features Specifications (with the exception of the 
‘SQL92’ only variant) provide the functionality required in our example scenario. 
The geometry interface has an ‘intersectionQ’ method (or the Geometry ADT 
has a function) that takes ‘anotherGeometry’ as a parameter and ‘returns a 
geometry that represents the point set intersection of the source geometry with 
anotherGeometry’ . 

Despite all these similarities, there are profound differences between the three 
specifications. This is seen most clearly in considering how each of them deal 
with the concepts of ‘feature’, the basic unit of geospatial data, and ‘geometry’, 
a complex type of value describing a location in space. Features may have a 
number of attributes, zero or more of which may be geometric. The requirement 
that features must have precisely one geometric attribute had been rejected 
as being too restrictive. For example it would not be possible for a feature 
representing a real world entity such as a road to have a ‘location’ modeling its 
extent on the ground and a ‘centerline’ modeling its idealized route. Nor would 
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it be possible to model aspatial features in the same way as more conventional 
features. 

The three DCPs for which OpenGIS Simple Feature Specifications exist pro- 
vide different approaches: 

— SQL: the specification is based on SQL92 with provision for databases that 
support ADTs. Features are represented as rows in tables or views. Geometry 
is handled either as explicit coordinates in additional ‘geometry’ tables, as 
Binary Large OBjects (BLOBs) or as an ADT. 

— GOM: the specification is based around OLE/DB. This is Microsoft’s stra- 
tegic platform for all data storage, query and access. This provides GOM 
interfaces to common relational elements. For example the result of a query 
can be returned a ‘Rowset’ GOM object. Again rows are interpreted as fea- 
tures. However GOM objects are used to represent geometries directly. 

— GORBA: the specification provides an object-oriented model for geospatial 
features built from the ground up. Both features and geometries appear as 
objects accessible through GORBA interfaces. 

OLE/DB represents the culmination of the process of integrating SQL and 
Microsoft technology by way of OBDG, DAO and the like. Nevertheless OLE/DB 
remains tied to a tabular model of data. It is closer in approach to the SQL 
specification than it is to the GORBA one. Rather than present a high-level 
object model of geospatial data, it represents a low-level model for data access. 

Indeed no clear concept of a feature exists in the GOM and SQL specifica- 
tions. Rows in tables and views are interpreted to represent features, but they 
are not represented directly by objects or values in the underlying system. Thus 
features flit in and out of existence as views are created and deleted. Given that 
there are no geometric constraints on what constitutes a feature, in theory at 
least, a view with a single column of integers could be argued to represent a set 
of features. 

Of course the decision to use interpretation on top of a tabular model was 
not taken lightly. The ultimate aim of the OGG is to see geospatial data more 
readily incorporated into everyday applications. The adoption of OLE/DB as 
the basis of the GOM specification provides immediate integration with a wide 
range of Microsoft tools and products. 

Given these differences, how does the OGG justify that these specifications 
represent different flavors of the same thing? A test of closeness between the 
various specifications might be expressed this way: 

For a user familiar with the native capabilities of two DCPs and pro- 
ficient at using an OpenGIS specification on one of those DCPs, how 
much more do they need to learn to be proficient at using the correspon- 
ding OpenGIS specification for the other DCP? 



Essentially this requires that, for those areas where the DGP provides no 
prior approach, the specifications for each DGP should be the same. However 
it is not inappropriate to use the native capabilities of a DGP directly. Thus 
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for OpenGIS Simple Feature Specifications, the ability to store, query and ma- 
nipulate geospatial data may differ from one DCP to another, because most of 
them are based on database technologies which provide precisely these capa- 
bilities. However there is no precedent for handling geometries in these DCPs. 
Consequently all flavors of a specification should provide semantically equivalent 
interfaces to geometries. This basically means that if the COM interface provides 
an ‘intersection’ method, then so should the CORBA interface. 

3 The Internet 

3.1 Where Is the Internet? 

The OGC was founded upon the recognition that the IT mainstream genera- 
ted technologies that could fundamentally alter all branches of computing. The 
OGC was founded in August 1994 when ‘component technology’ was the major 
innovation. Since that time the world has been exposed to another paradigm- 
shifting technology: the Internet. From the perspective of end-users, this is far 
more profound a development than the ability of components that make the task 
of building software easier. How then, in all this discussion of DCPs, has there 
only been passing reference to the Internet? 

Primarily because the Internet does not, in itself, constitute a DCP[10]. This 
is most clearly seen in the fact that other DCPs (for example COM and CORBA) 
include Internet technology within their remit. For example Microsoft provide 
Remote Data Services (RDS) to allow a visual COM component in a Web bro- 
wser to maintain the fiction of a connection with COM components hosted on a 
remote server machine. This approach allows OLE/DB services to be ‘remoted’ 
to a Web client. At a more general level, CORBA brokers can communicate bet- 
ween themselves over the Internet using the Internet Inter-Orb Protocol, HOP. 
Those that regard Java as a platform in its own right, are able to argue that it 
has the advantage of being developed at the same time the importance of the 
Internet was becoming apparent. 

Consequently COM and CORBA specifications can, in principal, be applied 
across the Internet without change. This has been demonstrated to be true with 
the OpenGIS Simple Features Specifications: 

— One demonstration used the COM specification and RDS to build a thick 
client running in a Web browser. Since the OpenGIS Simple Features Speci- 
fication deals with data access rather than geoprocessing, it does not provide 
for rendering on the server. Consequently clients typically pull data across 
to the client and render it there. By Internet standards, the resulting client 
is ‘thick’ due to the quantity of data that it manages locally. 

— Another demonstration made use of the CORBA specification to manipulate 
and interrogate features in a geospatial datastore without the need to pull all 
the feature data over to the client. Although this more closely represents the 
thin client model common to many Internet solutions, it required additional 
elements to allow the rendering to be carried out at the server. 
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These demonstrations show that defining specifications for DCPs that do not 
specifically include the Internet, does not preclude such solutions being deployed 
over the Internet. However they also demonstrate that the Internet cannot be 
made completely transparent. The resulting solutions do not necessarily conform 
to Internet based expectations and/or require extensions to existing interfaces. 

The problem does not stem from the inability of the DCPs to utilize the 
Internet, but rather from the fact that OpenGIS Simple Features Specification 
is not complete enough to describe solutions architected for the Internet. Indeed 
this is only an exaggeration of what happens when the specification is applied 
on the desktop. Although one might expect desktop applications to naturally 
break into tiers based on data access and presentation, the use of the simple 
features specification mandates that rendering occurs in the presentation tier. 
Consequently all the data required for rendering needs to be passed through 
the interface. Obviously when both tiers are provided by the same vendor it is 
possible to add additional interfaces to allow the two to communicate, but this 
cannot be the case when one is attempting to make the two tiers interoperate. 

CIS over the Internet is an area where new solutions are being experimented 
with, for which there is currently little consensus. This is in contrast to the 
area represented by the OpenGIS Simple Feature Specification. The problem of 
representing geospatial data in a database has been solved so many times before, 
there was little debate about the form of a solution. As the OGC tackles areas 
that are still under development, the task of reaching consensus becomes more 
difficult. CIS over the Internet is a prime example. 

In response to this the OGC has, in conjunction with a number of sponsors, 
organized a Web mapping test bed that will allow interested parties to demon- 
strate the merits of their approach. The hope is that this will readily reveal 
basic similarities in approach that will themselves provide the basis of a com- 
mon framework for GIS over the Internet. This use of a test bed is only possible 
as a result of the way in which the OGC has provided an environment in which 
traditional rivals can work together. 



3.2 Where Are the Maps? 

When one thinks of geospatial data and the Internet, it is difficult to ignore 
one of the most compelling expectations; the ability to view geospatial data 
from anywhere in the world, of anywhere in the world based on data located 
anywhere in the world. The traditional way to view geospatial data has been as 
a map. A ‘map’ is something more than the data from which it is generated. 
It includes the choice of what data should be displayed and how it should be 
presented. Indeed one of the ways publishers of geospatial data add value is 
the manner in which they are able to present the data visually. Sometimes this 
means presenting the same data in different ways for different tasks. 

In some areas, development of Internet related standards have been slower 
than many might have hoped [11,12]. When the Internet began its explosive gro- 
wth, it was already possible to construct pretty Web pages incorporating text 
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and images. Now we also have the ability to transfer applets, video and even mo- 
dels of a virtual world using VRML. But we still do not have an agreed format 
in which to exchange vector based images. 

Consequently virtually every CIS vendor has been obliged to develop their 
own mechanism to communicate ‘maps’ across the Internet. This duplication of 
effort appears even more absurd when one realizes that there are many domains, 
other than CIS, that require an ability to communicate vector-based images. 
As a reflection of this, the World Wide Web Consortium (W3C) has received a 
number of submissions in the area of 2D graphics[13], including: 

— Precision Graphics Markup Language: PGML, submitted by Adobe Systems 
Incorporated, International Business Machines Corporation, Netscape Com- 
munications Corporation and Sun Microsystems Inc., 03 April 1998. 

— Vector Markup Language: VML, submitted by Autodesk Inc., Hewlett- 
Packard Company, Macromedia Inc., Microsoft Corporation, Visio Corpo- 
ration, 13 May 1998. 

— WebGGM: submitted by the Boeing Company, Council for the Central La- 
boratory of the Research Councils (CCLRC), Inso Corporation, Joint Infor- 
mation Systems Committee (JISC), Xerox Corporation, 19 August 1998. 

The standards community frequently tasks the OGC with reviewing proposed 
standards with regard to their applicability within the geospatial domain. Ne- 
vertheless a suitable transfer mechanism for communicating vector-based images 
of geospatial data stubbornly refuses to emerge. 

3.3 Interfaces and Messages 

If one wishes to reach a large audience using the Internet, it is necessary to work 
with a wide range of client configurations. It is not practical to make too many 
assumptions about the browser configuration of a casual user that makes an 
inquiry, the result of which includes a geospatial element. When the requirement 
is for a very thin client that makes the minimum of demands on resources on 
the client browser, a raster image remains the best way of communicating a map 
over the Internet. 

However there are a number of techniques for making life more pleasant for 
the more frequent user of geospatial data over the Internet. Downloadable con- 
trols and applets can provide improved ergonomics and client side interaction. A 
vector based description of geospatial data rendered at the server, allows a cli- 
ent to fuse input from multiple servers, provide instant feedback from embedded 
tags and incrementally change the display during update. Rendered data need 
not represent the full information content of a feature, for example a rendered 
line may have been Altered based on the display scale to remove unnecessary 
vertices. 

The experiences of many CIS vendors would suggest that the ability to com- 
municate a structured result representing a map is crucial. This practical use 
of structured results would appear to represent the most likely outcome of the 
‘interface versus messages’ debate. 
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Interfaces provide a high-level and familiar model to program against. 

However methods defined in those interfaces may return sizeable, low- 

level data structures (messages) as a result. 

It might be noted that the unwritten assumption of the OpenGIS Simple 
Features Specification for CORBA was that CORBA objects are remote. This 
was in contrast with OLE/DB where typically the COM objects are local, alt- 
hough they may be representing data from a remote database. Consequently 
the CORBA specification includes elements which are designed to reduce the 
number of communications between client and server. The primary way of doing 
this is to take advantage of CORBA’s ability to return structures and sequences 
as method results. These are returned ‘by-value’, in contrast to the more usual 
approach of returning a reference to another remote object. 

Specifications are required in two areas; the interfaces themselves and the 
structures that are returned. Of course Internet based technologies for transmit- 
ting machine readable, structured data have already emerged, for example XML. 
One way to close the gap between specifications written for different DCPs is to 
require them to pass structured data using platform neutral standards. In such 
cases the requirement on a body like the OGC is to establish how they are used 
to encode standard geospatial concepts. 

4 Interoperability 

4.1 Interoperability at the Client 

Most attempts to define ‘interoperability’ require much more space than is avai- 
lable here. However it is worth noting that one can identify a number of different 
ways in which interoperability might be realized. Whereas we may aspire to one 
approach, the initial implementations of OGC specifications may only operate 
at another 

If there is a standard interface through which one client can interact with a 
number of different servers, then one has achieved a useful level of interopera- 
bility. For example one might have written a generic display tool that retrieves 
geospatial data from a number of servers and displays the result as a continuous, 
multi-scale map. One might be able to select a feature that originated on one 
server and use its location as input to a spatial query on data from another 
server. 

All this can be achieved with an approach that requires the client to explicitly 
manipulate multiple server connections. The individual servers are not aware of 
one another. Thus some tasks (like a spatial join between data from two different 
servers) needs to be either handled, or at least coordinated, by the client. For 
example one might wish to find all towns that contain a railway station where 
data for towns and railway stations exist on different servers. 

Most commercial solutions today implement ‘interoperability at the client’. 
Although there are a number of products that automate the coordinate multiple 
servers, most of these derive from the requirement to manage transactions across 
distributed servers. 
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4.2 Interoperability between Servers 

The example query described above required the coordination of multiple servers 
to obtain a result. One might imagine a ‘super server’ designed specifically to 
coordinate requests amongst multiple servers, without any need to store data of 
its own. A client of such a server would be relieved of many of its tasks. However 
truly distributed servers need to be ‘aware’ of one another in the absence of a 
coordinating client. A good example is the handling of feature relationships. 

For the purposes of this discussion, a feature relationship is a pairing of 
pointers that allow two features to be ‘related’. While the features involved live 
on the same server this can be achieved using techniques specific to the imple- 
mentation on the server. However, if there is a need to establish a relationship 
between features on different servers, it is necessary for both servers to share 
some mechanism for feature identification[14]. Frequently the solution involves 
an addressing model in which a feature identifier is expanded into a sequence of 
‘scope identifiers’ and identifiers within those scopes. Systems employing such 
an approach must also be able to deal with the distinction between a feature not 
existing and the server on which it lives being unavailable. 



4.3 Interoperability between DCPs 

The desire to a see a generic solution to the problem of providing interoperability 
between different DCPs, most notably COM and CORBA, is one that faces any 
vendor that cannot afford to rely on a single platform. As clear water opens 
up between the evolving Java platform and the place for Java in the Windows 
environment, it is difficult to imagine that this is not a problem that will afflict 
increasingly many people. Because the OGC is platform neutral, it is a problem 
that they face also. Clearly it is beyond the scope of the OGC to define a solution, 
but they monitor the situation and inform their members. 



5 Conclusions 

The OGC has set itself the task of developing specifications that it can bring to 
the market place with the promise that the CIS vendor community can deliver 
against those specifications. By removing the risks of vendor specific solutions 
and elevating the level of interaction between software components, it hopes to 
make the task of exploiting the geospatial data (that many organizations have 
access to) both easier and more common. In setting itself the dual challenges of 
providing technically sound specifications in a commercially realistic manner, it 
lays itself open to tensions readily identified with by CIS vendors. Consequently 
the OpenGIS Consortium provides a rare forum for identifying, discussing and, 
where possible, resolving these issues. This paper has sought to give an indication 
of the range and importance of these issues. 
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Abstract. Gombination of data and services (i.e. ’Interoperation’) is 
one of the key concerns to develop a geographical information busin- 
ess network, in particular for near real-time information derived from 
satellite-based earth observation. In this logic the current situation of 
the satellite ground segments is not satisfactory and will require a con- 
ceptual improvement. Various international standardization and harmo- 
nization activities, like GCSDS OAIS, CEOS CIP, FGDC GEO, OGC 
OpenGIS, ISO TC/211 are identified as the basis for interoperability, 
but will need to evolve to an ’integrated application infrastructure’ that 
organizes service interoperation, thereby building production chains for 
specialized high-value application products. Within the area of appli- 
cation programmes the European Space Agency has a vital interest to 
support this convergence and is investing in research projects to support 
European industry to participate in this process. 



1 Introduction 

The world of geospatial information is currently undergoing a rapid and profound 
evolution. There is the perception that time is mature for the development of a 
new industrial branch, with the opportunity to further develop the public sector 
and in parallel create a profitable private sector allowing beneficial cost sharing 
for both. 

For three out of the five main technologies involved - geographical informa- 
tion systems, remote sensing, geopositioning, telecommunications and distribu- 
ted computing - the space sector can contribute to effective solutions. Therefore 
the European Space Agency has a natural interest to explore the technological 
needs for the deployment of this sector. 

This is in line with a general re-orientation of the Agency’s work: while 
technology research has to remain the basis, there is the wish to enhance the 
Agency’s efforts towards a more application and industry-oriented approach, 
better exploring and preparing the potential commercial exploitation of tech- 
nology investments. This is particularly true for the sector of remote sensing, 



A. Vckovski, K.E. Brassel, and H.-J. Schek (Eds.): INTEROP’99, LNCS 1580, pp. 29—40, 1999. 
© Springer- Verlag Berlin Heidelberg 1999 




30 



G. Landgraf 



where an increase of attention towards the actual exploitation of data and their 
use for applications - as opposed to a genuine technological focus on the space 
segment - will be required. 

Application-oriented data exploitation is characterized by the need to com- 
bine data from different sources, possibly involving services of many different 
providers. By definition this requires ’interoperability’ in its largest sense, i.e. 
the technical possibility to combine data and services from different sources into 
’something new’ (which can be a new product or a new service) . In the context 
of the present paper service interoperability is extended to any function po- 
tentially required for data exploitation. This starts from resource identification 
or ’advertisement’ (directory), includes services that are a traditional focus for 
interoperability activities like the catalogue (including metadata search and or- 
dering), but extending to new functions to be standardized in the area of archive 
access, formatting and processing services. 

Traditionally interoperability is achieved by ’translators’, implemented (and 
maintained) with more or less effort, achieving mapping between different models 
with more or less loss of information. This is a reasonable approach if the number 
of data and services providers is small. If this number is too high, the number 
of necessary translators grows exponentially with a consequent collapse of this 
approach. 

The alternative solution is interoperability by ’standardization’. All players 
commit to offer their data and service according to a standard, either by im- 
plementing it directly or by translating to it. A federation of such providers 
being able to exchange data and services according to a well-defined standard is 
forming a de facto ’integrated interoperable infrastructure’, which can exchange 
data and services in a performant and cost-efficient way. This is the basis for 
the development of an ’Integrated Application Infrastructure’, which organizes 
the ’interoperation’ of this federation by implementing the application-specific 
workflows on the top. 

2 The Current Situation and Its Limitations 

In the past, civil Earth Observation satellites were strongly technology driven 
programmes, without pretending that the high investment for the development, 
construction and launch as well as the cost of operation of the satellite could ever 
be directly recovered. For what regards the data exploitation, ESA budgets for 
production activities for the supported satellites (ERS, Landsat, JERS, NOAA, 
SeaStar, IRS, Nimbus) were limited to correction and geocoding of the acqui- 
red images, without provision for efficient coordination of further application- 
oriented exploitation. 

Furthermore development budgets were allocated with a focus on the single 
satellite mission, closing each single satellite programme is a world of its own, 
with its proprietary terminology, data representation and services. This was 
acceptable for the scientific exploitation of the carried instruments, but prevented 
the propagation of these powerful information sources to really operational use 
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in applications. Scientists are used to adapt to specific terminology, formats 
and access mechanisms and develop temporary workarounds for their specific 
experiments. Typically an experiment or research activity is carried out once or 
for a limited time period, so there is no permanent load of additional cost from 
difficult access to data or the high effort for combining them. 

This is different for the applications sector. The additional time required to 
access and merge data with different access methods and heterogeneous formats 
can considerably reduce the value of a product for applications, where real-time 
or near real-time information is needed. Furthermore any “overhead” cost for 
building, maintaining and operating translators may be a decisive element if a 
“product” can be offered at a price which is acceptable for the market or not. 

It is therefore evident that an innovative approach is needed to perform the 
step from the current situation of ’satellite-focused’ programmes towards an 
fertile environment for the growth of new applications and business sectors. 



3 The Need for an Interoperable Production 
Infrastructure for Spatial Applications 

A long production chain needs to be organized to arrive from the raw material 
(“single source products” like satellite observations as provided by the satellite 
operators) to the application service provider, who can finally offer the extracted 
application-specific “product” - rich in domain-specific information but small in 
size - in a suitable way for end users. It will not be sufficient to extract this 
information from a single-source product, but it has to be combined with other 
geospatial information, e.g. other satellite, airborne or ground-based observati- 
ons, maps and GIS data. 



Dabi D4a Domein-Specific Data 0^ 
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Fig. 1. A ’typical’ application production chain to deliver Earth Observation informa- 
tion to an end user, integrating it with map and GIS data. The required data and 
services are attached to the interoperable backbone, allowing quick and standardized 
access. The governing ’active object’ which organizes the interoperation workflow is 
not shown. 
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There are many ’production steps’ involved to achieve a final application 
product, usually falling in one of the following categories: 

Data Selection: Given the enormous amount of data acquired, rapid analy- 
sis if a new acquisition contains information useful for a specific domain 
(e.g. an oil slick, an iceberg, a fire, etc.) is an essential activity to prioritize 
the processing of commercially valuable information. New efficient and very 
specialized algorithms will be required to perform this step. 

Data Integration: Merging of different data - usually from different sources - 
is an important step towards a “semi-finished” product which then requires 
only a more simple and rapid step of extraction of all relevant information. 
This activity requires particular skill and familiarity with a multiplicity of 
data characteristics. It will be impossible to deploy such a service in a cost- 
efficient way, if lower-level providers furnish their data according to ’local’ 
standards, implying the costly need to adapt the application to every single 
country, country-internal area and even township. The lack of interopera- 
bility - and consequently of interoperation - is probably one of the most 
prominent reasons why a ’data integration industry’ barely exists today. Lo- 
cal markets are too small to guarantee satisfactory return on investment, and 
for the global market the cost and time required for all necessary adaptions 
is prohibitive. 

Domain-specific Data Processing and Information Extraction: The 

step of reducing the data to the aspects relevant for an application domain 
is relatively simple, but requires a combination of domain-specific knowledge 
and expertise in data contents and format. 

Data Analysis: In many cases the end user will not be interested anymore 
in an image representation, but only in an extracted summary information 
(e.g.: crop forecast). This kind of analysis usually requires large domain- and 
remote-sensing specific expertise and highly specialized algorithms. 

Data Representation: The final step to be executed by the specialized appli- 
cation service provider is to assemble all relevant information and convert 
it into a format suitable for the specific application user. Depending on 
the domain, this can be an electronic map, an email message, a fax, a data- 
base update, etc. Again, this potential business sector faces problems similar 
to the data integrators. High-level application products serve very specific 
needs and therefore have a limited end user community. For this reason the 
initial investment to develop the service needs to be limited by a standardi- 
zed interoperable infrastructure, with this latter one becoming a mandatory 
prerequisite for profitable deployment of this business sector. 

Depending on the application domain all these steps will be executed by diffe- 
rent constellations of service and value-added providers, with different workflows 
putting the various steps above in sequence. Typically many of these steps will be 
repeated to achieve the final product after passing various levels of intermediate 
products. 

We must not have too high expectations on the price that a broad consumer 
market is willing to offer for a ’piece’ of geospatial information. Therefore it is 
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important that the cost of initial investment and reproduction for the value- 
added service industry is as limited as possible, still allowing an attractive profit 
margin. Where the geospatial information has a high cost of initial investment 
(e.g. satellite images) the price can only be kept down if this investment can 
be distributed between a high quantity of sold products - a goal which is only 
achievable if it is used by a vast number of end users in multiple different ap- 
plications. The ’high price’ policy applied in the past has by far not succeeded 
to cover the initial expenditure and today we can already observe a decrease in 
the price at which remote sensing products are commercially offered. 

The big advantage of the geospatial industry is that all ’raw materials’ can 
be made available in digital electronic format, so we don’t need highways and 
trucks to ship them from one participant to the other, but ’only’ a network with 
sufficient bandwidth. Consequently the time to ’ship’ the ’assembly components’ 
between the ’factories’ can be reduced to a neglectable factor. Making full use of 
state-of-the-art technologies in distributed computing and telecommunication a 
geospatial production chain can be built in accordance with the ’virtual enter- 
prise’ paradigm, cutting production costs by an order of magnitude with respect 
to traditional approaches. 

What is really essential is to have an interoperable network that allows access 
to data and services for their manipulation at sustainable cost. This basic in- 
frastructure has to consist of enhanced and cheap network capabilities, together 
with an EO/GIS application standard protocol. With the availability of such an 
interoperable backbone, each ’provider’ in the geospatial sector can ’attach’ his 
data and/or services at marginal costs, consequently obtaining the relevant profit 
derived from their usage. Only the availability and easy accessibility of products 
of a certain type and level will enable “business” for value-added providers of 
even more specialized higher-level products, targeted to specific application use. 

For what regards the network part there are other sectors who will drive 
the development. The multimedia market including digital TV also requires this 
kind of infrastructure and the market forces behind this domain are an order 
of magnitude bigger than ours. It can also be expected that some public in- 
frastructure investment will occur to accelerate the new markets. The EO/GIS 
application standard protocol on the other hand is proprietary to our domain 
and we have to invest into it by ourselves. 

As the final vision of a ’growing’ infrastructure, for each single application 
domain it can be expected that the availability and easy accessibility of spe- 
cific subsets of data and services will be the ’critical mass’ for the commercial 
deployment of this application - which itself is not only a consumer of other 
geospatial services and data, but is a new service itself, that can be attached 
to the network (potentially being part of the ’critical mass’ for even new, today 
not imaginable services). Applying this recursively, a production infrastructure 
for spatial applications can dynamically build up an ’information value tower’, 
providing more and more suitable products for an ever-growing community of 
end-users. 
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4 Ongoing Interoperability Activities and Trends 

Many different bodies - with different focus and level of abstraction - are cur- 
rently working in parallel on the subject of interoperability. The ones considered 
hereafter are only the subset, which is currently at least monitored in some way 
be the European Space Agency. The present contribution does therefore not 
claim to be complete. 



4.1 CCSDS 

The Consultative Committee for Space Data Systems (CCSDS) is an organiza- 
tion officially established by the member space Agencies. It meets periodically to 
address data system problems common to all participants and formulate sound 
technical solutions. 

Of particular interest is the reference model for an Open Archival Information 
System (OAIS) [9], developed in response to ISO TC20/SC 13. 

The OAIS model contains an in-depth analysis of data, metadata and servi- 
ces, including archival, metadata query, ordering and retrieval. 



4.2 CEOS Activities 

CEOS - the ’Committee for Earth Observation Satellites’ [8] - is an international 
platform to discuss issues related to satellite-based remote sensing. Its “Working 
Group Information Systems and Services” (WGISS), is structured into various 
subgroups, which achieve their technical work through tasks teams. Of particular 
interest are the activities of the ’Access’ subgroup. 



The CINTEX Task Team and IMS. The ’Catalogue INTeroperability Ex- 
periment’ (CINTEX) is a historical milestone for gaining experience on the issue 
of interoperable catalogue access. This activity was mainly sponsored by NASA, 
who contributed the IMS client and the IMS gateway, which has been customi- 
zed by various satellite data providers. The IMS network implements the second 
level of interoperable federation identified in the CCSDS OAIS model, i.e. in- 
teroperable query and metadata retrieval via a global node which can distribute 
a query to multiple local archives. The IMS client basically allows formulation 
of such queries and visualization of metadata information by a single client. 

While the level of syntactic interoperability is satisfactory and IMS is conti- 
nuing to be enhanced, CEOS decided, that the level of standardization should 
be improved by a more formally specified protocol to better exploit information 
derived from different sources. This would implement the Interoperable Catalo- 
gue System (ICS) as third and fully functional level of an interoperable OAIS 
federation, i.e. including standard ordering and dissemination mechanisms. 
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The Protocol Task Team and CIP. The ’Catalogue Interoperable Protocol’ 
[1] further standardizes metadata and services related to interoperable access 
to satellite-based remote sensing data. It’s initial version (CIP-A) was mainly 
developed by the European Space Agency, whereas the European Union and 
NASA are the prime technical contributors to the successive CIP-B version. As 
main innovation CIP introduced the concept of structuring data into hierarchical 
’collections’ with a two-level search approach - first for the collections of poten- 
tial interest and successively inside these collections. Furthermore interoperable 
ordering with the related problem of user management and authentication was 
addressed and an abstract model for the specification of order options agreed. 

For what concerns the technical foundation, CIP was decided to be imple- 
mented as application-specific profile of Z39.50. This had the big benefit of being 
able to inherit existing standards and software for search and retrieval, and al- 
lowed harmonization with the FGDC standard for Digital Geospatial Metadata 
[3], which had been defined for the geographic community. 

The drawback is the difficulty to extend a protocol like Z39.50 - focused 
on search and retrieval - to other services in a natural way. These inherent 
limitations lead ESA to the conviction that the next generation of a CEOS 
interoperable protocol should be based on CORBA, which meanwhile has become 
an acceptably mature standard. 

The protocol task team is closely observing the progress of ISO TC/21I [5], 
with the target to even better unify the CIP and Geospatial profiles on this basis. 
On the other hand there is considerable interest in the OpenCIS activities, which 
has lead to a proposal to the OpenCIS Consortium (OGC) for the “Catalogue 
Services Request for Proposal”. The building of this proposal which maps the 
CIP services to XML was a mainly NASA-driven initiative and aims to put the 
rich experience of the CEOS Protocol Task Team at the disposition of the OGC. 

Apart from the various specifications the following CIP software is currently 
becoming available: 

— the INFEO middleware produced by the European Union within the CEO 
programme; 

— the CIP-ODBC gateway produced by the European Union, allowing to con- 
nect local databases to the CIP federation network; 

— the CIP client produced by the European Space Agency [6]. 



4.3 The FGDC GEO Profile 

The U.S. Federal Geographic Data Committee has developed a Z39.50-based 
interoperable protocol standard for Digital Geospatial Metadata ( “GEO” ) [3] in 
parallel to the remote sensing community. A registry of GEO databases is kept 
at the FGDC and is known as the Clearing House. This standard is also succes- 
sfully used in the ’CEOnet’ (Canadian Earth Observation Network) where it is 
implemented by the ISITE software, a public domain Z39.50 software package 
supporting the FGDC metadata standard. 
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However, seemingly also FGDC had felt that more extensive services required 
a different base protocol and have prominently sponsored OpenGIS demonstra- 
tion activities. 



4.4 The OpenGIS Consortium 

While all other initiatives are government-driven, the OpenGIS Gonsortium 
(OGG) [4] is an industry-driven trade association. From the achievements in 
the last years it can be concluded, that the level of effort put into this stan- 
dardization activity is an order of magnitude greater than in the others. The 
correctness of the initial focus on a GORBA-based infrastructure has been con- 
firmed by the growing maturity and acceptance of this distributed computing 
solution, even if OpenGIS has elevated its abstraction level to be compatible also 
with other distributed computing platforms as DGE and DGOM and languages 
like SQL and JAVA. However, excessive opening could turn out expensive and 
dangerous for the standardization effort and will require a strong control of the 
OGG Technical Gommittee to avoid counterproductive ’pushing’ of members for 
their own base technologies. 

Extrapolating the current trend, OpenGIS has to be judged as a very promi- 
sing platform to specify the baseline of an interoperable EO/GIS infrastructure. 

4.5 ISO/TC 211 

TG2II [5] is the Technical Gommittee tasked by ISO to prepare a set of stan- 
dards related to geographic information. There is a common line in all other 
standardization efforts to modify any current standard along the lines given by 
ISO. However, the ISO focus is rather on data, while the other standards cover 
also practical implementation aspects. 

5 Activities of the European Space Agency 

As indicated earlier, the European Space Agency is placing more emphasis on 
application-targeted activities, as opposed to the past mostly science-related 
approach. For the development and deployment of applications two main ground 
segment activity lines need to be envisaged: 

— The investment in research activities in the application domains themselves, 
i.e. special processing, algorithms, etc. 

— The creation of a basic infrastructure that can be reused by multiple appli- 
cations. 

This latter infrastructure will enormously profit from successful international 
standarization efforts and it can be anticipated that it would not even be possi- 
ble without them. Data and service providers are globally distributed, and many 
of them under national competence. In particular in the European scenario stan- 
dardization is a must for the backbone infrastructure required for the deployment 
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of extensive application-oriented data exploitation. Apart from the participation 
in international activities, many pilot projects have been initiated to unify the 
services provided for the single satellites into an interoperable framework: 

— The CIP client [6] serves as demonstration and test tool for the CIP infra- 
structure on one hand and for the potential of an interoperable custom client 
capable of locally manipulating the retrieved data on the other hand. The 
CIP client is completed since December ’98 and available via the Web. 

— The Multi-Mission User Services (MUIS) project is developing a multi- 
mission distributed infrastructure, providing access for all ESA supported 
Earth Observation missions with a unified service concept including pro- 
duct and sensor guide, service directory, inventory, browse, on-line ordering, 
archive access (including post-processing). The initial release MUIS-A pro- 
vides ESA’s “earthnet online” service [7] which can be accesses at “earth- 
net. esrin.esa.it”. Initially it was planned to use the CIP protocol as internal 
standard, but the speed of the international standardization activity could 
not meet the requirements of the project. Thus, a MUIS abstract model and 
interoperable protocol (CIP) was developed, and the results were partly fed 
into the CIP activity. 

— CICCIA (Catalogue Interoperability through CORBA compliant Infrastruc- 
tures and Architectures) will analyze current specifications in the area of 
interoperable catalogues and define an implementable CORBA-based spe- 
cification. This activity has to be seen in support to the MAAT activity, 
but will also be the “working horse” for Agency participation in CORBA- 
infrastructure oriented standardization activities. 

While the above projects are targeted towards the provision of interoperable 
services, other pilot projects have been undertaken to implement the “interope- 
ration” of these services with an application-oriented view: 

— ISIS (Interactive Satellite Image Server) has implemented a first prototype 
of a possible application infrastructure, focussing on oil-spill detection. 

— The PATHEMA project implements a CORBA-based demonstrator system 
for generating multilayer thematic products and will generate detailed re- 
quirements on functions for data archiving, access and manipulation. 

— RAMSES (Regional earth observation Applications for Mediterranean Sea 
Emergency Surveillance) is a major experimental project co-financed by the 
European Commission and industry, involving multiple countries in the Me- 
diterranean bassin. It will provide a CORBA-based demonstration infra- 
structure for an interoperable production chain, resulting in a major acqui- 
sition of practical experience with the detailed problems in practically setting 
up an integrated application infrastructure. 

— The MAAT (Middleware Architecture for distributed earth observation Ap- 
plication Systems and Tools) project will analyze the end-to-end architecture 
of a CORBA- based application infrastructure and will specify the required 
objects and their interfaces. This study will essentially try to capture all 
experiences gained from earlier prototypes, the MUIS CIP protocol, and the 
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CIP activity, aligning them with the OpenGIS concepts. As a result a detai- 
led abstract specification for an interoperable production infrastructure for 
spatial applications is expected. 

6 Conclusion 

There is good reason for optimism that the ’critical mass’ for the deployment of 
a new large business area is achieved. However, we have to clearly understand 
that we are still in a pioneering phase. We are making big steps forward, but 
on the more detailed scale there are big problems waiting to be resolved and 
sometimes we will even need to take a step back. There are areas where much 
work has been done - e.g. for all services related to catalogues - while other 
essential ones are only at the beginning of being resolved. These include e.g.: 

— standardization at data level, e.g. cross-calibration just to mention one very 
basic problem related to remote sensing; 

— data access methods including reformatting; 

— data processing functions. 

For a long time we will still have to live with ’islands of interoperability’, 
small areas where the vision already works, while other areas will still have to 
undergo a conceptual re-thinking due to new facts and problems popping up. It 
will be important to maintain the current strong and positive collaboration of 
all involved bodies to abbreviate the construction period by pulling all together 
in the same direction. 
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Abstract. Finding the “right” geographic feature is a common source of 
interoperability difficulties. This paper reviews the issues and discusses 
how persistent feature identifiers can be used to support relationships and 
incremental updating in dispersed inter-operating information systems. 
Using such identifiers requires common definitions for concepts such as 
“scope” of datasets and identifier namespaces. This work extends current 
understanding in the Features Special Interest Group of the Open CIS 
Consortium (OGC). 



1 Introduction 

Compared with the requirements of an individual analyst, operational use of 
geographic information in a multi-user, multi-organisation application, adds sig- 
nificant new requirements in data maintenance, data transformation, lineage 
tracking, schema maintenance and metadata update [1,2]. 

These additional functions tend to involve several datasets with specific re- 
lationships, e.g. one dataset may be a prior version of another. We then require 
some way of tracking individual features across those datasets. The standard ex- 
ample is where some attributes of a feature are under the update authority of a 
different organisation from other attributes; but somehow all parties must agree 
that they are referring to the correct feature even if many (or all) the attribute 
values change. 

A study of feature identity presupposes that we will be dealing with mode- 
rately persistent real world objects which are observable as distinct entitites (at 
least for a while): entities that exist long enough to be worth naming and talking 
about. Thus this paper is firmly placed in the “object” rather than the “field” 
tradition of CIS, with the proviso that some of these objects may have indistinct 
boundaries [3] and may be temporary, e.g. sandbanks, storms and forest fires. 

This paper attempts to review conceptual structures which may underpin 
future interoperability standards. File data formats have a relatively short useful 
life compared to the life of the data they transport and “standard” function 
interfaces have even shorter lives, but the data model has a much longer life: 
almost as long as that of the data itself. 
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2 Background 

2.1 Unfulfilled Needs 

Current treatments of feature identity do not adequately support common user 
needs of incremental publishing (serial updates) and value-adding; where a user 
of a spatial dataset wishes to add more information to a published dataset, and 
yet retain synchronisation when that data is updated [1,2]. 



2.2 Conceptual Data Model 

It is necessary to outline a conceptual data model in which to frame the spaces 
and domains under discussion. The following is a simplified version of the Open 
CIS Consortium (OGC) conceptual data model [4]: 

Real World: the entire world in objective reality 
Conceptual World: the observed subset of the real world 
Geospatial World: a categorisation and classification of that subset 
Dimensional World: the classified entities with metric representations and 
spatial reference systems, but not yet represented in any software system 
Project World: the entities in a logical schema defined by a particular infor- 
mation community 

Software Worlds: a set of representations of the entities in an overlapping set 
of increasingly capable software systems with defined schemas 

In this paper a “feature” will be taken to be a software representation of a 
real-world object [4], e.g. a lake, road or city, which can have associated with 
it a number of attributes, some of which are geometric representations (“geo- 
metries”), i.e. shapes with locations. (Note that this definition differs from a 
commonly understood meaning where a feature is the geometric representation 
in a spatial reference system.) Thus a school is represented as a feature with one 
associated complex geometry which is the set of polygons representing the floor 
plans of the buildings and another that is the boundary of the site. 

If we examine the conceptual data model sketched out above, we will see 
that unique labelling can only be done for discrete objects which are already a 
categorisation of a subset of the real world. 



2.3 Labelling the Real World 

The first hurdle to overcome is the disbelief of those who think that suggesting 
the use of feature identifiers means that we must label and index everything in 
the real world. That is clearly infeasible and for almost all purposes we cannot 
assume that such an index exists. 

However there are organisations which do maintain unique identifiers for a 
great many types of real world objects, e.g. road bridges are numbered within 
the jurisdiction of a local government’s civil works department, telegraph poles 
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are labelled by the local telecommunications organisation, and every computer 
Ethernet card has a unique number burned in to it during manufacture (we know 
they are unique even if we don’t know where they are). These labelled real world 
objects are almost invariably man-made or even entirely man-imagined, e.g. land 
parcel identifications, for the simple reason that natural objects usually permit 
less precise delineation, e.g. is an estuary part of the coast or the riverbank? 
This indeterminacy is of several distinct types and has been discussed in detail 
in a recent conference [3] . 

The types of feature identity that will be introduced later in this paper must 
be able to interwork with these pre-existing labelling schemes that are maintai- 
ned by a variety of organisations with very different construction grammars and 
quality control standards. A key point is that the labels have to be maintained: 
mistakes happen and must be corrected, incorrect numbers are applied to real 
objects and real labels are misrecorded in software systems. Thus the real world 
label must be related to, but distinct from, any geographic information system’s 
feature identity. For these reasons it is clear that real world labels can never sub- 
stitute completely for some purposes, even though such labels probably provide 
more added value than other types of feature identity. 

2.4 Practicalities 

Practical interoperability has to take account of pre-existing data which may 
have been constructed using entirely different semantic principles. In the case of 
feature identity, the task in hand is not so much to provide support for those exi- 
sting dataset collections which provide sophisticated persistent identifiers as to 
provide mechanisms whereby datasets without persistent identifiers can nevert- 
heless offer useful services as if they did. For example, a good standard should 
not preclude the possibility that identity is constructed using a key based on 
one or more attribute values (as used in many RDBMSs), but neither should 
it mandate such an approach since many other GISs have a concept of identity 
which cannot be adequately represented by that approach. 

The process of standardisation introduces its own oddities and restrictions 
which are not present in either commercial or academic research software. In 
addition, the goals themselves are different from most academic work in GIS. 
University research tends to look for proof that an innovative technique is feasi- 
ble, elegant and efficient, irrespective of its compatibility with existing systems, 
whereas standardization research has to bring the bulk of the existing commer- 
cial implementations with it. Thus introducing new concepts has to be done with 
great care: 

— find the minimum number of new concepts required 

— consider whether these concepts need to be reified as software objects; 

— if so reified, whether these objects should be named, and 

— if so named, over what namespace the names should be unique. 

A “namespace” is a concept which has been found increasingly useful in the 
design of computer languages. The G++ language recently elevated it to be an 
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integral part of the language, and the Python language is fundamentally designed 
around it [5]. 

Thus while a concept may be universally agreed as being a useful aid in 
structuring a problem area, it can be nevertheless be productive avoid naming 
or reifying it because that can then avoid an entire harmonisation argument. 

For those objects (concepts) we are going to name individually we then con- 
sider how we might “get one” and what we could do with it once we’ve got it, 
i.e. what other object might be a factory for it, and what operations we might 
want to perform on it or with it. 



Example The technique of reducing conflict by avoiding the reification of a 
concept is demonstrated if the OGC concept of ‘Feature Type”. 

The OGC abstract specification defines the concept of feature type which 
specifies the attributes (properties) that a feature of that type can have, but 
the OGC Simple Features for SQL specification avoids making that concept 
into a manipulatable object and instead each individual feature can be queried 
as to what attributes it supports. This avoids introducing a new type into the 
function interface and avoids introducing a new namespace, the list of names of 
feature types, with its own uniqueness constraints. The savings in elapsed time 
to produce the standard and to test proposed implementations for conformance 
more than makes up for the slight inconvenience of not having a standard for 
feature type itself (which could be introduced into later standards if absolutely 
required) . 



2.5 Dataset 

A “dataset” has an obvious meaning when a simple GIS organises its persistent 
storage as files. However, large and complex GIS applications involve databases 
and possibly a large number of files for import, distributed update, etc. Thus we 
need a tighter definition or a different concept entirely. 

The OGG uses the term “Feature Gollection” to mean an object which re- 
presents a collection of features but which also may support its own attributes. 
The fundamental operation on a feature collection is to make available (in some 
way) the features in the collection. Feature collections many be permanent, e.g. 
a dataset, or transient, e.g. the result of a query. The metadata of a dataset thus 
become attributes of a feature collection. Gompatibility with other standards 
for metadata content can then cause problems because some which come from 
bibliographic communities allow “repeated fields” with different data, whereas 
most GIS data models for features (and an OGG feature collection is a variety 
of feature) allows only “name = value” semantics. 

The existing OGG “Simple Features” specifications do not need to go into 
details such as defining the feature type of a feature collection [6] but the OGG 
Abstract Specification suggests that any feature could belong to several feature 
collections at once. 
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2.6 Schema 

Many of the queries that one wishes to perform on a dataset logically should be 
queries addressed to the dataset’s schema, e.g. questions as to what feature types 
the dataset holds, what attribute names and types are defined for each feature 
type, etc. If we consider a set of related datasets, e.g. a set of versions, then 
we can see that the schema itself has a broader existence than each individual 
dataset and might be better identified with a “scope” (Sect. 3.2). 

Clearly if versions of the “same feature” may be found in several different 
datasets, it should have some of the schema in common between them (but 
datasets may not necessarily the same internal structure: versioning should be 
able to cope with restructuring, e.g. of a directory tree [7]). 

Consider for a moment the case where we have a real world feature descrip- 
tion, in this case the different feature representations of the real entity may have 
nothing in common: for example, the London suburb “Richmond” appears as a 
feature in both the London Tube map and as part of the UK postal code cover- 
age, but these have nothing in common and it is sensible to consider them as 
two different features (software representations). 

Initially it would be simpler to just assert that the schemas must be iden- 
tical in all datasets in the same scope where a feature identifier may be used. 
However we must bear in mind that serious geographic information applications 
are approaching 24 hour - 365 day operation, so some allowance for dynamic 
schema evolution will certainly be needed in the near future. 

2.7 Lineage and Metadata 

An important relationship between datasets which are intended to share feature 
identity scopes is “lineage”: the history of data. Lineage should be described in 
the metadata of a dataset: “the currency, accuracy, data content and attributes, 
sources, prices, coverage, and suitability for a particular use” [8]. 

For feature identity management we need something more precise than the 
rather loose semantics and grammar, defined only in natural language descripti- 
ons, which are commonly used in metadata descriptions. If a dataset is composed 
of several persistent feature collections then each could contain its own metadata, 
and in the limit of granularity, every individual feature could contain metadata 
on how it was constructed and under what conditions its attribute values were 
originally measured. 

There are geographic-specific metadata protocols and systems [9] as well as 
names and types [8], but the recent enormous growth in non-geographic meta- 
data protocols such as RDF [10] implies that the GIS-specific protocols will have 
a short life before they are merged into the mainstream. 
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3 How Many Varieties of Identity? 

3.1 Descriptors and Handles 

We have the concepts feature and dataset {feature collection). We think we need 
the concept of feature identifier, but some thought will show that we need two 
such concepts: a feature descriptor and a feature handle. A descriptor is some 
way of specifying a feature from “outside the system” , by listing some sufficiently 
unique combination of attribute values, where the list of attributes will be highly 
application dependent. If an external agency maintains real world labels, then a 
single attribute value may be sufficient: but from the software designer’s point 
of view the uniqueness of such labels cannot be entirely relied upon. A feature 
handle, however, is a concept that is “inside” the software system and which is 
required to have quite tightly defined uniqueness properties which are enforced 
by the software itself. 

Feature handles should generally be considered “opaque” and it might not 
even be sensible to think of them as “values” at all. Some handles might be 
string of text which is a query (in any language) which is sufficient to retrive the 
feature itself. 

3.2 Scopes 

Scope is a dataset management and metadata issue. A scope is a unit within 
which updating management can occur and probably the level at which schemas 
are defined. A scope is a collection of software and data in which a, feature handle 
has meaning. If we are going to insist on uniqueness, we must define scope. 
There are two interpretations for scope, the first broader than the second: 

Meaning Within a scope, a feature handle has meaning. 

Reachability A scope is a “domain of reachability” in that a reference (a feature 
handle) from a feature to another feature can “reach” another, e.g. to imply a 
relationship. 

A scope to be used for update could be larger than a domain of reachability, 
i.e. a system may allow references from a feature to another only within a part 
of the scope in which that reference (feature handle) has meaning, but might 
allow corrections to be made to anything it knows about. (Existing examples are 
systems which allow topological relationships only between features in the same 
thematic layer.) 

Even at our limited current state of knowledge, we can probably suggest 
that our lives will be simpler if we propose that a feature handle has meaning 
only within one scope, and that the handle itself contains the means to uniquely 
identify that scope. 

How we identify scopes and how we define the function which evaluates a 
feature handle to produce (on demand) that feature or how we evaluate the 
handle to produce some kind of scope handle, is another matter. 
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3.3 Universal Identifiers: URIs, Monikers, GUIDs etc. 

The problem of unique object identifers has been tackled by several other soft- 
ware industries before geographical information users became interested or aware 
of the issue. 

Existing distributed computing platforms all offer their own solutions to the 
problem, but the objects in these cases are responsive “live” software processes, 
not “dead” geographic features hidden inside proprietary software systems and 
not accessible to dynamic enquiries. These “live” objects include Microsoft COM 
objects, CORBA nameservices, Inter-Language Unification (ILU) “String Bin- 
ding Handles” (SBHs) and on-going work to develop Internet Service Location 
protocols [11,12]. Some of these object identifiers include type fingerprints and 
version information as well as providing uniqueness and persistence. 

The object identifiers from the bibliographic and World Wide Web communi- 
ties [10,13,14,15] are more what we require for geographic features though these 
too are evolving towards a “live” web-object way of operating. 



3.4 Names or Addresses? 

It is important to understand that names are not the same as addresses. They 
are conceptually distinct and some schemes implement them distinctly. 

— An identifier is a name with particular persistence and uniqueness properties. 

— A naming scheme is a system for creating these names. 

~ A name is resolved to an address by a resolution service. 

— An object server uses the address to retrieve the named object. 

— If you have a name, you need a registry service to tell you what resolution 
services are appropriate for it. 

A practical advantage of the separation of names and addresses is that an ad- 
dress can be ephemeral even if the name is permanent. By separating resolution 
as a separate service, a name can outlast its originating organisation [16]. 



3.5 Mechanisms 

When discussing any kind of object identifier problem, many people want to get 
straight in to discussing whether their favourite implementation mechanism will 
do what is required. There are basically two mechanisms for creating unique 
identifiers and the second comes in two main types: 

1. Pseudo random generation 

2. Federated hierarchical organisations 

a) where the same grammar is used by the subsidiary authorities 

b) where each subsidiary authority defines its own subsidiary adressing 
scheme 
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Pseudo-random generation is the surprisingly effective technique of genera- 
tion a large random number from some local source of entropy, e.g. timing of 
keystrokes on a keyboard or a short analysis of network traffic coupled with 
an absolute clock value. The probability of two separately generated identifiers 
being the same can be reduced to arbitrarily low levels by ensuring that the 
number is large enough and the entropy unbiased enough [17]. Microsoft’s COM 
GUIDs use this method. 

Federated naming schemes use a unique central authority which administers 
some top-level prefixes which it distributes to a number of other authorities, 
each of which adds something sequentially to the prefix and distributes further. 

Conceptually, the identifer in a federated system gets longer and longer as 
the depth of the subsidiary tree gets deeper, but in practice it is quite possible 
to work within a maximum defined length so long as this is increased universally 
every few decades. The Internet Protocol (IP), Domain Name Service (DNS) 
and Ethernet card numbering systems work like this. Some naming protocols 
put an explicit depth on the tree. 

The subsidiary naming systems may be subject to separate international 
standards, e.g. the service type names of the Service Location Prototocl (SLP) 
are registered with the Internet Assigned Numbers Authgority (lANA) [12]. 

The World-Wide Web architecture assumes that resource identifiers (URIs) 
are identifiable by their scheme name which determines the subsidiary naming 
scheme. This applies to all URNs and URIs (including URLs) [15]. A URN in 
absolute form consists of: 

<scheme> : <scheme-specif ic-encoding> 

where the scheme name usually contains usually only lowercase letters and 
digits. The scheme name identifies the naming service and, implicitly, the reso- 
lution service, which would be used to resolve the identifier [14,15]. A subset of 
schemes use a common generic syntax: 

<scheme> : / / <authorityXpatli>?<query> 

The familiar “http;” URL uses a DNS machine-name (and optional port 
number) as the authority. Note that the query option means even the http : 
scheme can be extended to arbitrary encodings using cgi scripts and “?” pa- 
rameter separators. The Handle System architecture can be defined with a 
scheme name “hdl ; ” . Thus a particular document has the persistent name 
hdl : cnri . dlib/f ebruary96-urn_implementors which (currently) resolves to 
the address: http://www.dlib.org/dlib/february96/02arms.html [16]. 

Given the existence of one universal naming system for organisations (DNS), 
there is no great reason to invent and maintain another. In the past, individual 
industries have had to organise their own hierarchical organisation naming sy- 
stems, e.g. the International Article Numbering Association (EAN) which assigns 
manufacturer identification numbers for barcodes for retail goods and much else 
besides (http://www.ean.be). 

Today the naming systems for extensions to multimedia email formats, Java 
packages, URIs and no doubt much else are “piggy-backed” onto DNS. Thus 
if an individual has a personal website http : //www . sargents . demon .co.uk he 
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can be sure that he can create a unique name for his collection of Java utilities 
by calling the package uk . co . demon . sargents . utils. 

It has been suggested that feature identifiers should use their geographic 
location and feature type as part of their unique identity. Unfortunately this runs 
into so many problems with different resolutions, projections, datums, accuracies 
and Feature Class Codes that this approach is not now being seriously followed. 

4 What Do We Need Identifiers for? 

Having reviewed a bewildering array of identification mechanisms, nearly all 
under active development, we can see that we really do need some clearer idea 
of what we want identifiers for and how we want to use them: 

1. When doing updates, we need to be able to determine if the supplied identi- 
fier in the update refers to a feature in the original dataset to decide whether 
to update it or to create a new one. 

2. When comparing versions, we want to be able to find the previous version 
of a feature from the current version and vice versa. 

3. When doing some work outside a CIS, we want to be able to make a reference 
to specific features in a published dataset which persists for some indefinite 
time into the future. 

4. Within some defined scope (usually the “same” dataset), we want to assert 
that a relationship of a certain type exists between two features and we want 
this relationship to persist when any feature collection containing the related 
features is copied to another dataset. 

5. When we copy a feature collection in its entirity, it would be useful if there 
were some simple relationship between the feature handles in the two copies, 
e.g. differing only in some prefix, so that access in both directions were quick 
and easy. 

We introduced feature descriptors and feature handles earlier; but how can 
we use them ? 

— A feature descriptor is useful only for finding and then acquiring a specific 
feature. It has no other purpose and since feature descriptors come from “ou- 
tside” the system expressed in a variety of types of language there can be no 
guarantee that two different descriptors will not produce (resolve to, evaluate 
to, return) the same feature. In which case the two (different) descriptors 
have “equality by reference” . 

— A feature handle will also produce a feature, but some scopes will also define 
that only identical feature handles can return the same feature. This means 
that “equality by value” implies “equality by reference” . 

~ There is also the intermediate case where a feature descriptor may be defi- 
ned by a global scope which provides a “re-phrasing” service such that two 
different descriptors of specified types could be cast to a canonical form in 
which “equality by value” does then imply “equality by reference” . 
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What do we mean when we say a feature descriptor “returns a feature”? 
There are two possibilities: 

1. we get a lump of binary data encoded in a “well-known format” which con- 
tains all the feature’s attribute data and sufficient references to a schema to 
be able to decode the names and types of the attributes, 

2. we get a feature handle which incorporates the identification of the scope in 
which is is valid. 

Whereas when we ask what happens when a feature handle “returns a fea- 
ture” , we must mean that we get the binary lump. 

Assuming that we used the descriptor to make a query on some third-party 
indexing service, we get a handle but we must then find the actual dataset 
repository. We require some naming scheme so that we can use the handle to 
then obtain information about the scope object itself (the dataset). 

Each scope may have its own coding function which it uses to evaluate the 
handle and to return the feature itself (in a well-known binary format) . 

Some scopes may publish their coding functions and allow independent access 
into the dataset, e.g. a directory tree of files in well-known formats, others may 
maintain their own integrity and require access using opaque handles using a 
private coding function. 

4.1 Mechanisms to Declare Scopes 

The URI and RDF mechanisms described earlier (Sect. 3.5) are specifically de- 
signed to be applied to old software systems inherited from a previous age (“le- 
gacy” systems, though strictly speaking they should be called “heritage” sy- 
stems). Thus so long as an existing geographic data repository is not still being 
updated, it can be annotated by setting up a URI which contains a subsidi- 
ary naming scheme and RDF descriptions of the metadata and schemas, and a 
subsidiary URI offering a unique identifier system for the individual features. 



5 Putting It All Together 

We have clearly seen that we need some type of “repository” which manages 
multiple feature collections (datasets), which maintains scopes, which can res- 
pond to queries about schemas, which can evaluate feature handles to return 
features and which can be identified and located from part of a feature handle’s 
observable value. We need such a repository for each “project” on which several 
people are working; incorporating several “lines of effort” at the same time. We 
have also seen that we need something (else?) which can evaluate feature de- 
scriptors and, if valid, return a feature handle. This latter service could make 
use of RDF and other metadata services and protocols. 

It is suggested that the repository manager be identified with a URN and 
that feature handles also be legal URNs but where the initial part of the scheme- 
specific encoding is the URN of its manager. 
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It is not necessary that the repository manager actually be “on line” at any 
time: the important characteristics are soley the persistence and uniqueness of 
the identifiers. If desired, these could be maintained by an entirely manual pro- 
cess consisting of paper forms and authorised signatures as currently used in 
many file-and-directory-based GIS archives. (The probability that an organisa- 
tion will want to participate and yet refuses to register with DNS is assumed to 
be neglible.) 

There are places for randomly generated identifiers, e.g. when generating new 
feature handles in disconnected remote sites, or generating short locally unique 
strings for storage as feature attributes [17]. However, since we need some kind 
of federated system for relating repositories which are going to exchange data 
anyway, it makes sense to use that for the primary architecture. We should also 
remember that not every feature necessarily needs to be issued an identifier, 
especially in inherited systems, and that identifiers do not necessarily need to 
be held “in” the dataset as attributes on the features. 

The types of relationships, e.g. version relationships, between the different 
feature collections making up a collectively-managed “scope” could usefully be 
partially standardised [7] to encourage a software component market. The ty- 
pes of relationships which already exist in ad hoc, manually managed systems 
are varieties of “source” data related to “working” datasets and eventually “pu- 
blished” data. Those GISs which provide version services have better defined 
semantics for the narrower domain of version control. 

6 Conclusions 

Doing a decent job of a “simple” matter of proposing standard ways of construc- 
ting feature identifiers thus turns out to involve interelated aspects of dynamic 
schema discovery, metadata granuality, formal version control semantics, distri- 
buted/replicated unique naming systems and a lot more besides. Despite the 
temptation to thow up our hands in horror, there does seem hope that very 
tightly-defined and narrowly focused feature handles may yet provide some usa- 
ble functionality which is worth the effort of implementation. 
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Abstract. Achieving interoperability between geospatial applications 
is a challenging research issue that attracts the attention of a growing 
number of scientists. By interoperability we mean that users of geospatial 
information systems can share their information in a distributed and he- 
terogeneous environment. In such a distributed environment establishing 
relationships between geospatial objects require a mechanism to provide 
a persistent and globally unique object identifier, GUOID. In this paper 
we argue that given the reasons for the need of GUOIDs, an object is not 
required to be stored with a GUOID in local databases. Instead, a me- 
chanism to provide a GUOID is only required when objects are outside 
the address space of the local database. 



1 Problem Definition 

Object ID is an essential component in object technology for distributed proces- 
sing. The main thrust of this paper is to attempt to resolve the issue of persistent 
object identifiers. This issue was realized during the recent OpenGIS(tm) mee- 
tings and is currently being investigated. We use the term Object to refer to 
a computational instance of an abstract data type that has an identifier and 
data, and provide services accessed through interfaces. The object identifier, ID, 
can either be locally or globally unique, temporary or persistent. Global uni- 
queness, GUOID, is used here in the unrestricted sense, that is an ID is unique 
everywhere, all the time. 

An object ID as referred to in this paper does not refer to key or foreign key 
attributes in the relational database sense. Furthermore, If two objects have the 
same GUOID it does not necessarily imply that they refer to the same real world 
object. It implies that the two computational instances are the same. An Object 
ID is an opaque identifier which is understood and resolved by the underlying 
system. It is more like a handle which is a variable in which the system can store 
meta information about an application and some object used by the application. 

OID can be temporary or persistent, locally or globally unique (more about 
persistent vs temporary OID is in section 2) . The system presented here is called 
the globally unique object identifier system, GUOID. A persistent GUOID has 
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several functions in a distributed heterogeneous environment. In general we need 
persistent object ID for: 

1. Version management and comparison when an object is retrieved by client(s) 
and later updated at the source’s database. A mechanism is needed to allow 
the client(s) to be notified and have reference to the new version. 

2. Back tracking updates: sometimes, e.g., for copyright, a client would like to 
know when and who has originated a retrieved or updated an object. For 
example an object X is sent to a client who updates or modifies it and stores 
it as XI. Later, XI is sent to another client who also modifies it and stores 
it as X2. It is needed to have reference from X2 to X at any time. Even the 
more challenging situation is that X2 can acquire reference to X without any 
prior knowledge of X and without persistently storing any reference to it. 

3. Maintain a correct reference to objects even if location is changed. In many 
cases information sources might change the address of their physical data 
storage, or even the whole organization might move to another city. In this 
situation a GUIOD schema should be independent of physical addresses. 

4. maintenance of complex objects created from primitive distributed ones. This 
is perhaps one of the ultimate goals from interoperability and distributed 
geoprocessing. In an interoperable environment we usually don’t need to 
keep a copy of the primitive building blocks of the complex objects that 
we have defined in our application. We only need to keep reference to them. 
There are several reasons to that. For instance. The application only requires 
the complex object and there is no practical use from its primitives. Another 
example is when the complex object is required for mission-critical operations 
and the latest update of its primitives is always needed. 

In this paper we present a mechanism to generate global unique object iden- 
tifiers, GUOID. The mechanism maintains the autonomy of the database mana- 
gement systems and allows them to generate their local unique object identifiers 
when the object is local, and converts it into a global persistent identifier when 
the object is posted to outside world.. 

Section two reviews the different activities and research efforts. In this section 
efforts are classified according to the type of the OID service they provide, i.e., 
temporary, persistent, and/or unique. The GUOID system is introduced in sec- 
tion 3. Section 4 introduces a non exhaustive list of cases of interactions between 
information sources and clients. The cases help to demonstrate the strength of 
the GUOID. The paper is then concluded in section 5. 



2 Temporary vs Persistent vs Globally Unique OID 

Temporary OIDs are assigned to objects which are created by a server object on 
behalf of a client object and are destroyed at the end of the session, or at the end 
of the originator’s life. On the other hand, OIDs which persist after the end of 
a session between a client and a server are called Persistent OID, e.g., serialized 
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objects in Java. Current paradigms of digital objects mostly support short life 
cycle that does not outlast the session time, as will be shown in section 2.1. 

Not all persistent or temporary OID are necessarily globally unique. Current 
DBMS provide OIDs services that are unique within the DBMS address space. 
The address space can either span one machine if the DBMS support single user 
environment, or span several intranet or internet machines if the DBMS support 
multi-user environment. However, these solutions are not designed to solve the 
problem of OID when applications need to reference or retrieve objects that are 
distributed in more than one DBMS. In this case, a globally unique object ID, 
GUOID, is essential, which is the focus of this paper. 



2.1 Temporary OID 

ODBC is a good example of temporary OID implementation. The Open Data- 
base Connectivity ODBC implements OIDs as handles. ODBC uses three types 
of handles: environment handles, connection handles, and statement handles [5]. 
The connect and the statement handles are managed under the environment 
handle. The relation between the environment handle and the connecting handle 
is one to many. Similarly the relation between the connection handle and the 
statement handle is one to many. The environment handle is the global context 
handle in ODBC. It redirects the scope of the ODBC engine to the underlying 
connection and statement handles. Every application that uses ODBC starts off 
with the function that allocates the environment handle and finishes with the 
function that frees the environment handles. 

The connection handle manages all information about the connection. A 
connection handle is used to keep track of a network connection to a server, i.e., 
a session, and for all routing of function calls across the network. The statement 
handle is used for all processing of SQL statements. Each statement handle is 
associated with only one connection handle. When ODBC receives a function 
call from an application and the call contains a statement handle, it uses the 
connection handle stored within the statement handle to route the function call 
the correct DBMS. 



2.2 Persistent OID 

The common object request broker architecture, CORBA, provides an inte- 
resting approach to persistent OID. The model is called persistent object service, 
POS cite6. POS provides a comprehensive interface architecture to handle the re- 
lationship between its different components. As will be shown later, our GUOID 
model has the POA in its heart. Therefore, we will summarize POA model in 
the sequel. 

In POA, a persistent identifier, PID, is intended to allow a CORBA object 
to have a persistent reference to its data, e.g., attribute values, in a database. 
As shown in Fig. 1, POA has four components: persistent identifier, persistent 
object, , persistent object manager, and persistent data service. 




58 



Y.A. Bishr 



The PID identifies one or more locations within a database that house the 
persistent data for an object. The ID is generated in a string representation. A 
client can create and initialize a PID, and associate it with object to be used in 
a persistence model. 

The Persistent objects, PO, are objects that can store and retrieve their own 
data. In other words, a CORBA object must have a PID to be able to store its 
data. A PO can connect and disconnect from a database, and store, restore, and 
delete its data from a database. The persistent object manager, POM, acts as 




Fig. 1. The components of the POS, after Hoque 1998 



a switchboard, routing requests from PO to the correct data store. Because the 
POM acts as a router for a client, the client is shielded from the specific data 
storage and retrieval mechanisms used by a given database. 

The persistent data service, PDS, acts as the mediator between a persistent 
object and its associated database, once the database is located by the POM. 
Because databases vary in their functionality, the approaches used to accomplish 
PDS to database interaction are usually different. 
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2.3 Globally Unique OID 

The Internet community has adopted the term Uniform Resource Name, URN, 
for a name that identifies a resource or unit of information independent of its 
location. URNs are globally unique, persistent, and accessible over the network 
and identify resources of information. For example uni-muenster.de is a valid 
URN. A uniform resources locator, URL, identifies the location of an instance of 
the resource identified by the URN. For example, uni-muenster.de/index.html is 
a valid URL syntax [10]. 



123.12334/50 




Naming Authority 



\ 



Prefix 



Suffix 



Fig. 2. Syntax of the handle system 



Several proposals have sprung out of the URN initiative. The digital object 
identifier, DOI, initiative launched in October 1997 by the international Digital 
Object Identifier Foundations aims at providing a framework to manage intellec- 
tual content, e.g. literature, and the rights which accompany that content, such 
as access rights and copyright [1,2]. Khan, et al. 1995 have introduced a frame- 
work for distributed digital object services. The framework provides a schema 
for naming, identifying, and invoking digital objects [8]. 

The handle system is a system for assigning, managing, and resolving persi- 
stent identifiers, known as ’’handles,” for digital objects and other resources on 
the Internet. Handles can be used URNs CNR, 1998, Sun, X., 1998). The handle 
system is currently presented as an Internet-Draft to the Internet Engineering 
Task Force (IETF) [7]. 

The Handle System includes an open set of protocols, a namespace, and an 
implementation of the protocols. The protocols enable a distributed computer 
system to store handles of digital resources and resolves those handles into the 
information necessary to locate and access the resources. Perhaps one of the 
main advantages of the handle system is that the associated information can be 
changed as needed to reflect the current state of the identified resource without 
changing the handle, thus allowing the name of the item to persist over changes 
of location and other state information. 

Every handle in the handle system is defined in two parts: its naming autho- 
rity, otherwise know as its prefix, and a unique local name under that naming 
authority, otherwise known as its suffix as shown in Fig. 2. The Handle System 
protocol mandates UTF-8 [11] as the only encoding for any handles specified in 
the protocol packet. 
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3 Discussion 

We recognize that the temporary object ID approach is not a disadvantage. 
The two techniques, persistent and temporary IDs, serve two different purposes. 
Temporary OIDs in general are more practical and efficient when reference to 
objects is only needed during a client-server communication session, and no 
further reference is required when the session is terminated. 

The CORBA POS model provides a comprehensive protocol to locate and 
retrieve objects that have a persistent ID. However, it does not guarantee that 
an ID is globally unique in the unrestricted sense. The POS can guaranty that 
the PID is unique within the space of a client-server session. On the other hand, 
the handle system provides a practical protocol and naming scheme that ensure 
a globally unique and persistent object ID. The main disadvantage of this ap- 
proach is that it violates the autonomy requirements in distributed processing 
and enforce its schema on the underlying information systems. It also suggest 
that each object is stored with its unique ID, which adds overhead on the lo- 
cal database that have to guarantee each time an object is created that its ID 
globally unique. This in fact does not allow the application to implement its 
own independent unique ID schema. In some case applications need to have an 
interpretable ID for other purpose. 




Fig. 3. Components of the GUOID system 
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In the next section we will present the GUOID system and attempt to tackle 
the issues we have raised here. We first introduce an overview of the system. 
We will then examine several use cases to demonstrate how the system would 
behave. 

4 An Overview of the GUIOD System 

As shown in Fig. 3, the core of the GUOID system is based on the POS and 
the handle system. In the GUOID system and similar to the Handle system a 
source address has two parts a prefix and a suffix. Each source has one prefix 
and can have several suffixes depending on the objects to be communicated with 
the outside world. The main characteristics of the GUOID is that the system 
does not enforce an OID schema on the local data store. Instead, the GUOID 
replaces the local OID and assign it to merely any outgoing object. 

Gommunicating parties should adopt and be able to resolve the handle system 
mentioned in section 2.3. This is done cooperatively between the PDS and the 
local data store. As shown in Fig. 3, the POM is responsible for routing the 
objects to the correct client and source. The PDS resolves the GUOID and 
maintains the relationship between the prefix and the GUOID in a global registry 
file. The data store resolves the suffix part of the GUOID, and maintains the 
relationship between the local ID and the suffix in the local registry file. Each 
source or client maintains its own global and local registry files. The registry 
files maintains only a reference to the incoming objects not the outgoing ones. 
The information about the relationship between the local object ID and the 
GUOID is stored in two registry files as shown in Fig. 4. This mechanism of 
registering the relationship between GUOID and local OID allow the source to 
know who received its objects and allow the client to know who sent the object 
at any given time. This is different from the Handle system which enforces the 
underlying local database to save the objects with their globally unique handles. 

Information sources who want to provide the GUOID service should maintain 
a local unique object ID in their local database. The local ID should not be 
changed so often and should be persistent as long as possible. In case that the 
source supports versioning, a registry of the relationship between the ID of the 
original version and that of the subsequent ones should be maintained. 

In this reminder of this paper we introduce some use cases to help us under- 
stand the system behavior. The following cases assume that there is a client and 
an information source, or simply a client and a source. 

Case 1: Searching for an object for the first time, e.g. data mining The GUOID 
does not support this case. This is in fact intuitive and should not be sup- 
ported since no prior knowledge of the object is known to the client. GUOID 
is un-interpreted octet string, compatible with the handle system, and the- 
refore is not involved in the search process. The GUOID is not intended to 
replace logical identifier of objects, e.g., keys, foreign keys and other attri- 
butes meaningful attributes in the database sense. 
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Fig. 4. Interaction between client and source 



Case 2: Object is retrieved for the first time As shown in Fig. 4, a request, or 
a query, is sent by the client to the source. The request has a reference to 
the client’s address. If the requested object is found, it is then sent to the 
client, based on the address that was associated with the original request. 
In the GUOID system, each outgoing object has two identifier attributes. 
The first attribute has the full GUOID address of the source. The second 
attribute has the prefix address of the client that ’’tells” the object where 
to go. The structure and the interaction of the global and the local registry 
files are also shown in the figure. 

Case 3: Object is retrieved and later referenced for update or retrieve further 
information 

The client has an old version, or require extra information of the object. 
Fig. 5 shows a flow chart of the operations involved. The local registry file 
of the client has a reference to the object’s suffix. First the local registry 
file is searched for the suffix. When found it replaces the object’s local ID 
and then sent to the PDS. The PDS in turn searches the global registry file 
for the prefix that corresponds to the suffix and then sends the object to 
the corresponding source via the POM. At the source, the PDS truncates 
the prefix and sends the object to the local data store. The local data store 
updates the object. The reverse of the process is then repeated and finally 
the object is stored at the client. 

Case 4: Object is retrieved from a database, then original object is updated 
at the source In this case the client requires to be notified of any updates 
on the retrieved objects. A broadcast mechanism can be designed such that 
the source sends an update alert on the network. The alert includes a list 
of GUOIDs of the updated objects. The PDSs on the network may receive 
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Fig. 5. Operations involved in object update in GUIOD system 



the alert and search their global registry files against the GUOID list. If a 
match is found case 3 is repeated. Otherwise, the PDS can ignore the alert. 
Case 5: Object is retrieved and then sent to another client This is similar to 
the situation mentioned in the list point number 2, section 1. In this case 
an object is retrieved not from the original source and the client need to 
reference the original source. Since any outgoing object will have its original 
GUOID assigned to it, the receiving client will always have a reference to the 
original source. Gase 6: Local ID of the original object is changed Although 
not recommended, same broadcast mechanism mentioned in case 4 can be 
applied. The client then receives the broadcast alert and updates the global 
and the local registry files. 

Case 7: Gomplex object is created from primitive ones that are stored remotely 

The GUOID system does not require the data store to explicitly store the 
object. The client may choose to store only reference to the primitive objects 
and supply to the application remote operations to access the remote primitives. 

5 Conclusions 

The crux of the OID problem is to ask the right questions. Instead of only asking 
the classic question 

”How to assign a globally unique OID”. Three questions should be asked 
instead: ’’Why do we need a GUOID?”. we have tried to answer this question in 
section I. The second questions is ”Do we really need to have our local database 
store its objects with a globally unique object ID?”. Sections 2 and 3, showed that 
we only need globally unique object IDs when communicating an object with the 
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outside world. Based on the answer of the first two questions the third question 
was ” how to assign ID to object such that they appear to be globally unique 
to the outside world, and at the same time maintain the application autonomy 
and appear unique to local applications. 

We have introduced the GUOID system which is a hybrid between the 
CORBA model and the handle system. The GUOID system goes one step further 
and provides a mechanism answer the third question using registry files. 

Only the most basic elements of the GUOID infrastructure are described 
herein. These elements constitute a minimal set of requirements and services that 
must be in place. One restrictive point of the GUOID system is it assumes an 
ideal world of no system crashes, no error exceptions, etc. Further investigation 
for fault tolerance and error recovery is required. Finally, the use cases are not 
meant to be exhaustive, further investigation is required to cover larger spectrum 
of cases and implementation tests. 
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Abstract. In this paper, the authors’ main focus is presenting the im- 
plementation architecture of Ramroop and Pascoe’s [9] conceptual model 
for data integration in a Geographic Information System (GIS) environ- 
ment. The model is briefly discussed and extensions to the model are 
made. These extensions are also applied to the notation Ramroop and 
Pascoe [9] used to denote the entire data integration process. The model 
consists of a Data Center and Data Agencies at a national level. Details 
of the actual processing steps followed within the Data Genter is discus- 
sed and the general architecture design is described. The implementation 
of the concept is modular starting with the first service called the Selector 
Broker. Multiple tools are being considered for further investigation. 



1 Introduction 

Over the past decade the use of Geographic Information Systems (GIS) has 
increased by a vast majority of professionals and non-professionals. One of the 
major problems that the development of GIS applications is facing today is data 
acquisition, Devogele et al. [3]. 

Large quantities of data sets are available as hardcopy and softcopy. At the 
National level, such data sets are viewed as independent repositories which are 
maintained by the various organisations that collect the data sets. 

When integrating data, specific difficulties are encountered. Abel et al. [1], 
stated that the difficulties encountered by an application developer are the pre- 
sence of differing data models (such as the network model, the relational model, 
and so on); different Application Program Interfaces (API); and semantic clashes. 
These difficulties all vary from one data set to another which hinder the imme- 
diate realization of a solution to the integration problem. On-going research in 
this area is described by: Ramroop and Pascoe [9], Abel et al. [1], Devogele et 
al. [3], Pascoe [7], and so on. 

Ramroop and Pascoe [9], presented a conceptual model for National integra- 
tion of geographic data. In their model, the entire process of integrating data sets 
is modelled starting with the user formulating a query, to the final destination 
data which is transformed into an acceptable format. Their model reflects the 
typical National scenario where organizations are autonomous because of their 
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varying mandates. The need for organizations to co-operate and inter-operate 
together motivates the sharing of data and ultimately the development of fede- 
rated databases. 

In this paper, Ramroop and Pascoe’s [9] model and notation for representing 
the overall process are discussed and extended in Sect. 3 and Sect. 4 respectively. 
The main focus in this paper is to address the architecture of such a system which 
is discussed in Sect. 5. 



2 The Virtual Data Center 

The model addresses the problem of data integration by making use of a Virtual 
Data Center. Ramroop and Pascoe [9], defined a Virtual Data Center as: 

The virtual processing center through which data sets from Data Agencies 
are located, transformed, and delivered to other Data Agencies who have 
requested data sets to satisfy their needs. 

Data Agencies will have one or more users and will provide the software necessary 
for participating as a member of the Virtual Data Center. The Center is made 
up of a number of services which are executed in a specific sequence starting with 
the user query. These services are specific with specific operations. The general 
details of these services and their operators are discussed in Sect. 3. 

The Virtual Data Center can be accessed by all users whose interest lies in 
integrating data. Similarly, Abel et al. [1] presented the design of their view of 
a Virtual CIS for distributed spatial data processing in heterogenous environ- 
ments. Their system, however, is aimed at extensibility and scalability through 
the distribution of the processing load. Their system’s architecture comprises 
global frontends and several backend component systems through which the 
processing load is distributed. 

In this paper, the architecture of the Virtual Data Center is presented by 
explaining the processes involved when integrating data sets at the National 
level. Ramroop and Pascoe [9], indicated that, the process of integrating data 
sets, involves, in no particular order: 

— selecting data sets appropriate to the CIS application; 

— transferring data sets into a common data format (without unnecessarily 
loosing any information); and 

— merging data sets initially collected at various level of details (such as graphic 
databases with varying map scales) . 

These processes are representative of the many task needed to be performed and 
considered when integrating data sets. In Ramroop and Pascoe’s [9] conceptual 
model the Center’s processing features are used as a basis for further research. 
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3 National Conceptual Model 

The Conceptual Model representing a National data integration strategy is 
shown in Fig. 1. The model comprise of two components: the Virtual Data 
Center and the Data Agencies. Each Data Agency continues to operate separa- 
tely as they did in the past by creating, storing, and using geospatial data sets. 
However, when the need for other data sets arise, then the services of the Virtual 
Data Center is accessed. 

The model indicates the processing of geographic data sets which are stored 
at the Data Agencies using various data schemas defined by the multiple GISs 
available. The storage of such data sets are usually a collection of related files 
that stores the geographic locations and attributes of each feature. Therefore, 
these files are usually large and their contents are variable when compared to 
other data types (for example customer data from banks) which are readily 
manipulated and restructured to be used in other applications. 
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3.1 Brokers 

The services of the Virtual Data Center is facilitated by the use of Brokers. 
Ramroop and Pascoe [9], defined a Broker as: 

The agent responsible for executing specific processes associated with data 
integration within the Data Center. 

The Brokers divide the generic process of geographical data transfers into smal- 
ler, specialised tasks which are performed by different modules. There are four 
Brokers, three of which were defined by Ramroop and Pascoe [9]: 

— a Selector Broker for finding, selecting, and prioritizing data sets appropriate 
to user requirements; 

— a Planner Broker for optimising the data transformation process overall; and 

— a Transform Broker to perform the functions of the encode, decode, move 
(communicator), translate, and merge modules for combining data sets. 

A fourth, an Administrator Broker for managing legal implications is defined 
below. Combining the use of these Brokers enables Data Agencies to select, 
access, and merge multiple data sets from other Data Agencies satisfying their 
needs. 



Administrator Broker. At the National level. Data Agencies have spent large 
sums of money to capture data (Peel [8], Johnson et al. [5], Dueker and Vrana [4], 
Abel et al. [1], and so on). Although money is spent throughout the development 
of any CIS implementation, 80% of the total cost is during the initial data cap- 
ture stage, Aronoff [2]. Therefore, using data available from other Data Agencies 
is a potentially cheaper alternative. 

Laurini [6] , commented on the aspects of administration of geographic multi- 
database systems. He indicated that consideration should be given to encourage 
the creation of an inter-organisation protocol. This includes the mechanisms to 
deal with problems such as: 

— copyright: Who is the owner of the data? 

— access rights: not all end-users are granted permission to retrieve or to use 
any kind of data everywhere, therefore some limiting access rights must be 
defined; 

— difficulties during prototype implementation: indeed during the integration 
procedure some sites can crash then, who is responsible? 

— results property: by mixing information issued from different databases or 
sources then, who is the owner? 

— accounting problems: what is the cost of the data and who is paid? 

Such a protocol would be the corner stone upon which the Administrator Broker 
will be developed to address. 

Issues of cost for data and other legal administrative concerns. The Admi- 
nistrator Broker is defined as: 
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The agent responsible for managing all legal transactions associated with 
the sharing of data sets. 

The Administrator Broker is responsible for: 

— requesting legal procedures from Data Agencies corresponding to the selected 
metadata; 

— informing the user of the legal procedures to be followed; and 

— ensuring that all legalities are done before the metadata is transferred to the 
Planner Broker. 

Typically, the Administrator Broker is responsible for ensuring that all legal 
procedures (such as payments, copywrite, intellectual property rights, royalties, 
security, and so on) are considered before processing is allowed to continue. 
Therefore, the Administrator Broker ensures that the relevant Data Agency 
authorities are informed of the pending transaction with the user. 

Having included the notion of the Administrator Broker, the notation pre- 
sented by Ramroop and Pascoe [9], will now be extended to include the Admi- 
nistrator Broker. 



4 The Notation 

Ramroop and Pascoe [9], defined a notation (Table 1), for describing the transfer 
of data from one GIS to another as a sequence of transformations. The notation 
defines a number of operators. Each either contributes to the transformation 
of data values from a representation required by one GIS to that required by 
another GIS, or changes the location where data is stored. 

A generic transfer is denoted by: 

^ select r ^ admin r ^ plan / ^ \ * ^ 

Q ^ {fdai — 



where 

Q {^} denotes the set of metadata {p,} reported back to the user by the 
Selector Broker in response to the user query Q 

{fj,} “‘fridp denotes the selected set of metadata processed by the Admini- 
strator Broker to ensure that all legalities are done, to produce an approved 
metadata set {/ia} 

{/ia} t( 5*) denotes a function r representing the transformation processes 

to be performed on each data set {na\ by the Transform Broker 
r(5*) A V denotes data sets 5, are transformed using either one or more of 
the decode, translate, encode, merge, and move operations into the final 
destination data set(s) V 
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Notation 


Associated term 


Definition 


i-A 


A data transfer 


The transfer of a data set from one representation to some other 
representation. 


translate 


A data translation 


The transformation of a data set to conform to a different concep- 
tual, implementation, or physical schema. 


move 


A data movement 


The physical movement of a set of values from computer system x 
to computer system y. 




Many data transfor- 
mations 


One or more data translations or data movements. 


Q 


User query 


User request specifying multiple criteria (a, /3, 7 . . .) 


M 


Metadata 


Information associated with each data set. 


select 


Select operator 


A process which selects and prioritized metadata. 


admin 


Administrator ope- 
rator 


A process which addresses the legalities associated with the sharing 
of data. 




Planner Function 


A function representing the optimal sequence of data transformati- 
ons. 


plan 


Plan operator 


An operator which selects the order and sequence of data transfor- 
mations. 


decode 


Decode operator 


An operator which transforms data values in a file format into equi- 
valent values in memory. 


encode 


Encode operator 


An operator which transforms memory data values into equivalent 
data values in a file format. 


II 


Merge operator 


An operator which combines two or more data sets into one data 
set. 



Table 1. Notation used for describing data transfer at a National level 



5 System Architecture 

To best understand the overall architecture of the entire system, the processing 
steps are identified in Fig. 2. Generally, the data integration process is initiated 
by the user while the actual integration of data is done by the Virtual Data 
Center. Once a query is submitted, the services of Brokers are all dependent 
upon the other starting with the Selector Broker. 

Apart from the final output of the destination data, there is another exit from 
the Virtual Data Center. Such a loop-hole occurs at the Administrator Broker. 
This occurs if the Administrator Broker reports back to the user with legalities 
indicating that the selected data is not accessible, (for whatever reason), then 
the user will be given the opportunity to make another selection or quit. In cases 
where the Selector Broker reports back to the user with metadata indicating that 
some data sets are in analogue form, then further processing within the Center 
will cease for such data set(s). 

The system architecture of the entire Data Integration System at a National 
level is divided into two infrastructures. One is associated with the storage of 
metadata at the Data Agencies, and the other is associated with the Virtual 
Data Center. 
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At the National level there are ‘i’ metadata directories and ‘j’ databases. When integrating data 
there are roughly seven steps. They are as follows: 

1. The user sends a query either with single or multiple criteria to the Selector Broker. 

2. The Selector Broker reports back to the user with the metadata that matched the query, howe- 
ver, prior to reporting back to the user the following sub-processes are executed: 

— heterogenous directories are searched for matching metadata; and 

— the resulting metadata is ranked as a percentage of the criteria. 

3. The user selects the metadata corresponding to the desired data set(s) which is sent to the 
Administrator Broker. 

4. The Administrator Broker flags the respective Data Agencies containing the desired data set(s), 
requesting the legal procedure to follow with respect to acquiring copies of the data. This 
information is sent to the user. The user ensures that all legalities are done and then submits 
an acceptance code to the Administrator Broker. 

5. Once the code is accepted by the Administrator Broker, the metadata is sent to the Planner 
Broker to order the transformation processes needed to be performed on each data set. 

6. The Planner Broker orders and sends the transformation recommendations to the Transform 
Broker. 

7. The Transform Broker copies the selected data set(s) from the federated databases and applies 
the respective transformation on each data set according to the recommendations made by the 
Planner Broker. The final destination data set is then sent to the user. 



Fig. 2. Processing steps used for National Data Integration 



5.1 Infrastructure of the Data Agency 

Each Data Agency is responsible for storage and maintenance of their own me- 
tadata sets. For the purposes of this research heterogenous databases and direc- 
tory systems are used to store metadata. A variety of standards for metadata 
is being defined. Examples are: the Australia New Zealand Land Information 
Council (ANZLIC) Metadata XML/SGML standard providing guidelines for 
the content of geographical metadata; Content Standard for Digital Geospatial 
Metadata (CSDGM) developed by The Federal Geographic Data Commit- 
tee To implement the Selector Broker, the ANZLIC standard is used as the 
basis for defining and storing metadata. 

^ http : //www. environment . gov.au/database/metadata/anzmeta/ 

^ http : //www. fgdc . gov/Metadata/ContStan.html 
® http://www.fgdc.gov/index.html 
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The ANZLIC metadata standard is utilized in Australia and New Zealand. 
This standard is a working standard and for the purposes of avoiding the issues 
related to other standards, the ANZLIC standard is assumed to be more than 
sufficient for this research and fulfill the requirements needed to perform the ser- 
vices of all four brokers. Other additional assumptions for each Data Agency is 
that they would all follow the ANZLIC standard to store metadata; they would 
have a fast and reliable Internet access; they would allow the execution of per- 
mission to run the infrastructual software; and they would have a commitment 
to the overall Data Integration concept embodied by the National Data Center. 

5.2 Infrastructure of the Data Center 

The Virtual Data Center is owned by all Data Agencies which are inter-connected 
via the Internet. The Center consists of the software that implements all of the in- 
frastructural software at each Data Agency and the software of protocols needed 
for interconnecting these agencies. The Center’s software is stored and mana- 
ged by designated Data Agencies. The Center is platform independent which 
is made accessible through an applet embedded within an HTML (Hypertext 
Markup Language) document. When a user accesses the Center, a copy of the 
applet is copied into the memory of the user’s computer where the processing is 
done. 



6 Test-Bed Implementation 

The implementation of the Center is being researched in a phased basis, starting 
with the implementation of the Selector Broker since this is the input to the 
Center. A test-bed is developed in order to test various strategies associated with 
each Broker. A number of strategies will be defined and each will be evaluated 
in terms its effectiveness and compared with the others. 

The tools being used to implement the test-bed are those which are commonly 
used by the CIS community for example object oriented programming, relational 
databases, object databases, and object-relational databases. 

6.1 Selector Broker 

The Selector Broker is built using Java. The Java Database Connectivity (JDBC) 
provides the ODBC connection interface to SQL (Structured Query Language) 
databases. Therefore, Java programs can access data in almost every SQL da- 
tabase including Oracle, Sybase, DB2, SQL Server, Access, FoxBase, Paradox, 
Postgres, SQLBase, and XDB. However, tests are being done using metadata 
stored using Access, Dbase, ObjectStore, and Postgres database management 
systems. 

The Selector Broker uses metadata stored at each Data Agency. The services 
of the Selector Broker includes: 

— searching for metadata based upon the criteria stated in the user query; 
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Fig. 3. Tools used for selecting metadata 



— prioritizing the metadata results as a percentage of the criteria; and 

— reporting back to the user with the results. 

The approach for developing the test-bed for the Selector Broker is one in 
which the user has access to the Center through an applet which accesses the 
metadata stored in relational, object, and/or an object-relational databases. The 
tools facilitating the functions of the Selector Broker is shown in Fig. 3. 

The query input to the Selector Broker is built using the metadata standards 
as defined by ANZLIC. All possible criteria for selection is listed on a Graphical 
User Interface (GUI). This is necessary because users would be informed upfront 
of the type of metadata stored and the criteria available to build the query. 

In the process of writing the infrastructual software associated with the Selec- 
tor Broker, access to ObjectStore meta database is made possible using CORBA 
ORB (Object Request Broker). To do this, CORBA is used to describe all avai- 
lable services, components, and data. Using CORBA’s IDL, the metadata repo- 
sitory stores the interface specifications of each object an ORB recognizes. 



7 Summary 

In this paper the concept of data integration at the National level was addressed 
by extending the research of Ramroop and Pascoe [9]. Their model and notation 
for a Virtual Data Center was briefly presented and extended by introducing the 
Administrator Broker and its operator into the notation. The data integration 
processing steps was identified which assisted in the design of the overall Center 
architecture. 
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The infrastructures for the Data Agencies and the Virtual Data Center were 
then presented concluding with the tools used to define a test-bed for the Selector 
Broker. Further research is being done with regard to the tools to be used to 
execute the roles of the other Brokers. 
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Abstract. Application systems in the earth observation area can be 
characterised as distributed, platform-inhomogeneous, complex, and cost 
intensive information systems. In order to manage the complexity and 
performance requirements set by these application scenarios a number of 
architectural considerations have to be applied. Among others the most 
important ones are modularization towards a component architecture 
and interoperation within this component model. As will be described in 
this paper, both are mandatory to achieving a high degree of reusability 
and extensibility at the component level as well as to support the neces- 
sary scalability properties. In our paper we refer to the state of the art in 
earth observation application systems as well as to a prototype system 
that reflects to a high degree the above mentioned system characteristics. 

Key Words: Distributed Information Systems, Earth Observation Sy- 
stems, Applications, Interoperability, Middleware, CORBA 



1 Introduction 

Since the early ’70s an increasing number of satellites orbit our planet and make 
observation data related to sea, land, and atmosphere available globally. The 
data is used in support to a number of applications; the best known might be 
the daily weather forecast satellite maps and animations shown on the TV news 
programmes. Fleets of new satellites will produce about 1TByte of new data 
every day and soon the amount of data collected within a single year will equal 
the size of all acquired data of the last 25 years. With the increase in observation 
platforms also the number of applications is increasing. For example, since the 
early ’90s, Europe has been exploiting its European Remote Sensing Satellite 

* The contribution of the author is related in particular to his work during a one year 
research fellowship assignment at NASA/GSFC 
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ERS-1 and -2, e. g. for sea ice monitoring, oil-pollution monitoring or in support 
to disaster management. Earth observation (EO) satellites represent an invest- 
ment of several hundred million ECU per space craft. In order to justify such 
investment, which still is largely public funded, new emphasis has been given 
in recent years to the development of ground segments with focus on exploit- 
ing the data streams received from the satellites for specified applications also 
beyond scientific use. The present paper focuses on architectural considerations 
of application-specific information systems for data exploitation. 



2 Earth Observation Application Data and Information 
Management 

EO systems distinguish between a space segment, comprising the space craft 
and associated command and control systems for flight operations, and a gro- 
und segment, comprising facilities for data acquisition, processing, archiving and 
distribution. In the following focus is given on the ground segment and in par- 
ticular ground system elements that may help to relay observation information 
to its specific use in end-user applications. EO Ground Systems are defined by 
means of three levels: 

— ’Data Level’ (DL): large scale infrastructures primarily operated by space 
agencies and satellite operators as data providers (DP), handling the data 
acquisition from the EO satellite and the data handling for standard data 
processing and archiving; 

— ’Information Level’ (IL): infrastructures primarily operated by Value Adding 
Companies (VAC) and scientific institutions (SI) for creating higher level 
application specific information, e. g., through thematic processing, and used 
by Value Added Resellers (VAR) for distributiong such information; 

— ’End-User Level’ (UL): user access infrastructure, interface and local infra- 
structure serving scientists, governmental and commercial users, and the 
educational sector. 

Traditionally, DL infrastructures interface directly with the user segment on 
UL, serving a multitude of user requests through a single system architecture. 
Search and order of standard data products are the main functions externally 
accessible in such systems. Figure 1 illustrates now the additional level, at pre- 
sent in prototyping stage, the IL. The IL constitutes an additional layer, inter- 
facing upstream with the DL for the provision of standard EO products, and 
downstream interfacing with the UL for application-specific user access and dis- 
tribution. The IL is not one single system but comprises a multitude of smaller 
systems each serving a different user domain and each interfacing the DL le- 
vel separately. Some IL functions, in particular data ingest may be provided by 
space agencies and data providers in general. Other IL functions, e. g. thema- 
tic, application-specific processing, may migrate towards VAC or SI. Selected IL 
functions may be shared within a cluster of individual IL systems. IL functions 




Advanced Earth Observation Application Systems 



77 



such as archiving of thematic products may be covered by DP, thus reaching 
from DL into IL functions. End-users (EU), depending on the service support 
required may be associated either with an SI, a VAS or a VAR; even in some 
cases EU may be registered with more than one information service supplier. 




Fig. 1. Information Level 



Prior to focusing on the IL, its functions, associated interoperation, tech- 
nology issues, and federation concepts, an estimation of EO data volumes and 
access statistics shall help to understand the magnitude of the problem present 
day DL are facing. Estimates in the domain depend significantly on a number 
of assumptions, e. g., user behaviour and market evolution, and on the range of 
satellites considered, e.g., geostationary and low-orbiting satellites, meteorolo- 
gical and environmental satellites. However, the numbers provided will help in 
a first approximation to determine a number of requirements for the design of 
the IL infrastructure and IL system concepts. 



2.1 Data Volumes 

EO Ground Segments handle high volumes of data, both in terms of archived hi- 
storical data and in terms of newly ingested data. Estimates of the total volumes 
considering all major LEO observation constellations world-wide point at more 
that 300 TByte of archived data in over 20 million inventory records, varying 
between a few kBytes and more than 2 GByte in size. Ingest of new EO data is 
estimated in the range of I TByte per day, with a corresponding user pull above 
2 TByte in the next future. The actual user pull will depend on how successful 
the data exploitation will be performed. One important enabling element for 
such exploitation and global data usage are interoperable IL systems. 
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Table 1. EO Information Systems - Volume Estimates - World- wide 



Existing Archives 
(1970’s-present) 


Number 
of Users 


Daily Ingest from 
Satellites as 
from 1998/99 


Daily User Pull 
as from 1998/99 
(media&on-line) 


Total Data 
Volume 


Inventory 

Entries 


Single Data 
Item Size 








> 300 TByte 


> 20 Million 


kByte - 
> 2 GByte 


> 20000 


500 GByte - 
> 1 TByte “ 


> 2 TByte 



“ Numbers derived from NASA, ESA and EUMETSAT estimates 



2.2 Access Statistics 

As an example, based on earlier NASA estimates for US based systems, Figure 
2 provides estimates of category and number of users, and nature of electronic 
access to EO data and information, as expected for state-of-the-art DL systems 
becoming available in the near future. 



Electronic Access Categories 



Processing (11%) Machine-to-Machine (6%) 




User analysis of data (37%) Browse-only usage (27%) 



Fig. 2. EO Information Systems - Electronic Access Categories 



According to the example, the scientific community with an estimated >12000 
users globally represents the largest part of the user population. It is also wit- 
hin this community that ’user analysis of data’ at UL is most prominent with 
37%, as the scientific usage of data often requires in depth analysis of particular 
phenomena related to a data set. User requests leading to ’processing’ represent 
another significant usage type and may include application specific processing of 
data, e. g., extraction of thermal fronts from sea surface temperature data along 
coast lines for detection of fishing grounds. Such thematic processing represents 
an estimated 11% of the accesses. It is also within this type of usage that user ac- 
cess may result in subsequent machine-to-machine interoperation hidden to the 
user, e.g., transfer of a particular data set from a remote archive to a thematic 
processing system. 
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A different utilisation is related to the users in the educational community, 
estimated at 70000 - 200000 users. Usage is focused on browse-only (27%), e.g., 
investigation of thumb-nail images without further data processing or analysis, 
and subscription services (19%), where users register for data provision related 
to specified locations, times, or events, e. g., the weekly image of their hometown. 



3 Information Level Federation Concepts 

3.1 Information Level 

At present, DL systems need to handle very diverse usage profiles in a single sy- 
stem. However, government use and in particular the percentage of commercial 
usage are expected to grow significantly in the coming years as distinguishable 
application domains. VAC and SI emerge for serving these new user market seg- 
ments. They will require automation. E. g., a high percentage of ’user analysis 
of data’, today performed by experts at UL after consultation of DL, may need 
automation to serve these new user communities. To do this effectively, systems 
will need to be capable of handling application specific usage in a way today 
only available for expert users, for example a scientist with the knowledge where 
to find the data and which processing to apply in order to obtain desired infor- 
mation. Prototyping of such systems, constituting the forerunner of IL systems, 
is currently in progress in a number of selected application domains. 

First results have shown that individual IL functions may be applicable to 
more than one application domain and system. E. g., advertisement of availa- 
ble services, which is today rather limited because services are already known 
to the scientists user group, becomes a challenge across IL systems in order to 
attract new governmental and commercial users to the information services. In- 
teroperation of participating components for advertisement, storage, processing, 
workflow, and access is becoming essential. Proper partitioning of applications 
into a federation of individual IL systems is important in order to achieve the 
scalability and performance required by individual user groups. I.e., distribution 
and interoperation of data and functions and machine-to-machine interopera- 
tion is becoming an issue. Table 2 provides a comparison of complexity between 
individual data level systems vs. information level systems. 

Individual IL systems distinguish themselves from large-scale DL systems 
in that they typically respond to the needs specific to a scientific discipline or 
particular application, e.g., flood monitoring or urban planning. Unlike satellite 
operator’s large-scale data systems managing multiple TBytes, these systems are 
mostly concerned with a more limited amount of data, e. g., multiple GBytes, 
specific to a defined usage. The number of individual systems is expected to 
be one order of magnitude higher for IL than for DL. DL systems interoperate 
primarily internally, whereas IL systems may develop different degrees of intero- 
peration in smaller clusters. User interfaces, rather uniform for DL may become 
application specific on IL, reflecting well-identified requirements of much smal- 
ler and defined user groups. This will lead to complementary interfaces to the 
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Table 2. Complexity: Data vs. Information Level 





DATA LEVEL 


INFORMATION LEVEL 


Nature of System 
& Service 


Universal 
(’query from hell’) 


Specific to application 


Typical Data Volume 
per system/service 


TBytes 


GBytes 


Number of individual 
systems/services globally 


< 10 


> 100 (estimate for 2005) 


Level of interoperation 


Primarily internal to 
individual system. 

Inventory external 


Interoperation in a number of clu- 
sters (estimate > 20 globally) to 
varying degree depending on type 
of federation 


User Type 


Mostly scientific 


Science, commercial, educa- 

tion/wider public 


Number of users 
per system/service 


> 10.000 


±10 government and commercial 
±50000 education/pnblic 


Processing requirements per 
system/service 


Complex, across 

many scientific 

disciplines 


Known with application, e. g., one 
algorithm per system 


Access/Dissemination 

requirements 


Universal 


Tailored to user type 



general purpose data search and discovery interfaces dominant in today’s Earth 
Observation User Information Systems on DL [9]. 

The emergence of open, network-centric middleware, in particular the Object 
Management Group’s Common Object Request Brokers Architecture (CORBA) 
[11,16,10], and the wide availability of advanced communication networks, in 
particular the Internet, provide the essential elements for the required underlying 
infrastructure for distributed IL services, operating in a federation of individual 
systems. 



3.2 Various Federation Scenarios 

DL system developments inherently provide a high level of interoperation in a 
distributed environment as they are typically implemented as single large-scale 
projects, e. g., ESA’s ENVISAT Payload Data Segment (PDS) [13], or NASA’s 
Earth Observing System Data and Information System (EOSDIS) [3] with its 
Data Model and Distributed Information Management (DIM). However, this 
interoperation is merely internal, i.e., interoperation options for external data 
providers and different DL systems either assume the adoption of the internal 
standards of a given DL system or external interoperation is limited or non- 
existent. The Committee on Earth Observation Satellites (CEOS) has made 
an attempt to achieve interoperation of DL systems for a limited functional 
scope, i.e., for queries of large scale inventories, and has specified the Catalogue 
Interoperability Protocol (CIP) [1]. 
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In contrast to the above, IL systems are expected to be developed in a mul- 
titude of smaller development projects. As individual projects may not have the 
resources to develop and serve all required service functions internally, a mar- 
ket opportunity for externally available, interface-compatible system functions 
and services beyond catalogue services exists. IL systems and services provided 
primarily by VAC and SI would orchestra in a federation characterised by 

— The level of distribution of individual information service functions, like the- 
matic image processing, and its allocation under the responsibility of diffe- 
rent players in the service market, e.g., data providers, VAC, SI. 

— The level of interoperation and reuse between different systems and distri- 
buted information service functions, like common advertisement services or 
sharing of a common data archive between IL systems under the responsibi- 
lity of VAC or SI. 

Four different degrees of federation are proposed in the following, ranging 
from a ’Non’ Federation, with distribution and interoperation only within the 
DL, to a ’Full’ Federation configuration where most functions are migrated to 
IL and distributed with a high degree of function interoperation. 



’Non’-Federation The ’Non’ federation (see Figure 3) represents the state-of- 
the-art in EO information systems and in principle reflects today’s DL-only con- 
figurations, i.e. without IL. All functions typically are provided through a single 
distributed DL system (per world-region) appearing to the user as a single sy- 
stem. DL sub-systems, e.g., for processing and archiving, interoperate according 
to a single DL-internal schema, called the common DL schema. 



DATA LEVEL 
•Archiving (of 
all data levels) 
•Processing 
•Access/Distribution 




USER LEVEL 

Users 







Internet & Media 



Fig. 3. ’ Non’-Federation 
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This schema distinguishes a user view from a conceptual data description 
and a physical data description. The user view is a logical description that re- 
presents the end-user perception of the data. The conceptual data level descri- 
bes the data assets, illustrating relationships among classes, e. g., defining the 
data input/output relations per sub-system, and specifies the attributes of the 
data. The physical data level refers to the platform dependent representation of 
the data as implemented using commercial database management systems. The 
common DL schema is essential for DL-internal interoperation. Different DL may 
offer catalogue interoperability services based on an agreed schema subset or a 
derived standard, limited to their inventory subsystems. Reuse of components 
within the DL is maximised. Partitioning of the DL into application-specific 
information system components is however strongly limited by the fact that in- 
dividual components are typically designed to handle a broad range of different 
information. Therefore, DL systems as such are not easily adaptable or sizable 
to application-specific systems as defined in the IL. 

’Processing’ Federation The ’Processing’ federation depicted in Figure 4 
shows thematic processing as first functions migrated into a thereby created 
IL. Application specific information is typically generated under VAC or SI re- 
sponsibility. Some VAC or SI may decide to share processing functions, i.e., an 
algorithm may be executable within a cluster of VAC IL systems. User access 
and distribution has not changed and is still achieved through the DL system. 
The IL typically builds on the DL schema. Although the ’Processing’ federation 
improves the adaptability of the system to application-specific requirements and 
allows a first separation of service functions, its scalability and its reuse potential 
are still limited by the underlying DL. 



DATA LEVEL 

•Archiving (of all data-product USER 




Fig. 4. ’Processing’ Federation 
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’Processing and Access/Distribution’ Federation The ’Processing and 
Access/Distribution’ federation (in Figure 5) is an essential step towards a fede- 
ration allowing the VAC or SI to identify and distinguish itself visible to the user. 
The IL may be defined by different overlapping IL schemata, which for archival 
functions may still be those of the associated DL. A cluster of a few IL systems 
may decide to operate through a single user interface and provide a common 
user request management, e. g., forwarding of user requests and context infor- 
mation between their services. However, access functions are expected to serve 
as distinguishing feature for individual IL systems and will only show a limited 
level of interoperation. This interoperation may focus on directory and trading 
services for the advertisement of available IL services across a cluster. Interfaces 
to CIS for the provision of non-EO information as complementary data may be 
part of a VAC or SI offering. Standard data product archival is still performed 
at DL, management, storage and archival of thematically processed data may 
however be allocated to the IL. IL sub-system components, e.g., ingest module, 
may be reused from one IL system to another. With most functions migrated to 
the IL, the level of modularization of service functions, the system scalability, 
its reuse potential and adaptability are high. Prototype systems reflecting such 
a federation concept are currently being demonstrated (see chapter 4) . 



DATA LEVEL USER 




Fig. 5. ’Processing and Access/Distribution’ Federation 



’Pull’ Federation The ’Full’ federation depicted in Figure 6 reduces the DL 
to the long-term archive for lowest level data products. Other archival functions 
have been moved into the IL segmented according to geographical regions and 
applications. Together with interoperable interfaces for archive services within a 
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cluster, the number of interoperable, related processing services is expected to 
increase. User access/dissemination is expected to remain with a lower level of 
interoperation. Interoperation with traditional DL systems is no longer an issue 
for most VAC or IS as this is dealt with within the IL. The lack of any common 
IL schema maintained outside the IL provides a challenge to the interoperation 
within the IL. Leading federation members, e. g., those comprising the archival 
function, may provide a reference data model and serve as architectural model. 
Level of reuse, scalability, modularization, and adaptability is comparable to the 
’Processing and Access/Distribution’ federation. 



INFORMATION LEVEL USER 




Outlook on Federations Depending on the success of DL systems in the 
different world-regions and the pressure from application markets, the degree of 
federation in the longer run may be different between regional markets. It may 
be established between the ’Processing and Access/Distribution’ federation and 
the ’Full’ federation, which most likely remains merely a long term perspective. 
Much will depend on the way in which new enabling technologies will be inserted 
into the development process. Leading IL service and system developments may 
set the pace for the federation, or a co-ordinating working group may provide 
recommendations for the IL. 



3.3 Technology Considerations 

During the last years system engineering and software design for federated in- 
formation systems have undergone tremendous technological changes. Modern 
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architectures are based upon components with well-defined interfaces and beha- 
viour. This step enables reuse, extensibility, and scalability of systems or system 
components. Adoption of new technology or application requirements can be 
done by exchanging single components only. Furthermore, each module can be 
developed by separate companies with different expertise. In order to enable 
the usage of third-party software published without source code, several compo- 
nent models have been defined, e. g., Microsoft’s DCOM or ActiveX [15], IBM’s 
DSOM [7], OMG’s CORBA [10], etc. 

Considering the great Internet wave starting in the early 90’s, component- 
based design became even more important. Companies want to establish so- 
called virtual enterprises, accessing resources of their partners and offering e.g. 
online order facilities for their customers. Thus there was an increasing need 
for standardized components which can interoperate with other ones via In- 
tra/Internet, no matter which programming language or operating system is 
used. The success story of Java and its well defined components started. Though 
Java offers several ways for component interoperabilty, it is still a particular pro- 
gramming language. In order to be extensible w.r.t. any kind of system aspects 
it is not appropriate to define interfaces of components in a specific and single 
programming language. For example, in independence w.r.t. interface definition 
can be easily achieved by means of CORBA’s Interface Definition Language 
(IDL). It is independent of programming languages, but mappings exist or can 
be developed for any language as needed. In addition, arbitrary components can 
interact through ORBs and interfaces, and the behaviour of basic components 
are already standardised by the CMC. 

Though we have just detected a suitable component model, federated systems 
generally raise another requirement: Data models of each participating partner 
have to be compatible. CORBA does not offer mechanisms to resolve this is- 
sue, but federated database technology (e.g., as provided by so-called database 
middleware like IBM’s Data Joiner [5]) together with standardization endea- 
vours (e. g., OGIS) may help. The different federation scenarios presented before 
differ in their degree of distribution of functions and degree of standardisation. 
From the database point of view, these differences translate to system transpa- 
rencies for data storage and data access. ’One-stop-shopping’ is the idea that a 
user can issue an information service request to a (logically) single system and is 
freed from possible query decomposition into multiple sub-queries to distributed 
data sources, issuing those queries, and perhaps finally integrating the results 
to be presented to the user. For higher level information it may be required to 
know how this information was derived. The ability of the system to support 
such questions is the issue of pedigree or data provenance. Clearly, less control 
over data and metadata model, and an increased degree of distribution make the 
necessary transparencies more difficult to be achieved. 
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4 A ’Processing and Access/Distribntion’ Federation 
Prototype 

In order to prepare for the challenge described above, a number of prototype de- 
velopments have been initiated, e. g., the Earth Science Information Prototypes 
(ESIPs) [4] in the USA, or European projects funded by ESA and the Euro- 
pean Commission. A European prototype, the Interactive Satellite Image Server 
(ISIS) project [2,6], and its successor project RAMSES, aim at defining and va- 
lidating IL interoperation with emphasis on the distinct definition, distribution, 
and interoperation of functions such as data ingestion, processing, cataloguing, 
and user access as well as their implementation and validation on a common 
system backbone. More information on this prototype can be found in [14]. 

4.1 Distributed Functions 

Figure 7 depicts the various IL functions identified and illustrates their inter- 
connection through a common bus based on CORBA technology. The User Client 
interfaces refer to the UL, whereas the Data Ingest interfaces refer to the DL. 
This architecture has been applied by three IL system developments, each for a 
selected application: detection of fishing grounds, urban planning, and oil pol- 
lution monitoring. All components are defined through a well defined CORBA 
IDL interface. A high level of reuse of components for standard data ingest, 
image processing, and catalogue has been achieved between the systems. Alt- 
hough client and workflow functions are application specific, different modules 
on sub-component level are reusable also, e.g., image display and animation. 



User Client 




Workflow Catalogue Image Image Data 
Processing Repository Ingest 

Fig. 7. Information Level System Components 



Figure 8 presents the interactions between the different functions for a given 
application, here the monitoring of oil pollution through the analysis of radar 
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imagery. The UL client’s interface to IL functions is through a workflow module 
which has a priori knowledge of the user, its application, and the associated 
relevant catalogue entries and processing algorithms. 



CLIEt^ 

1 Start Oil SUckMarllorlng 



IMAGE PROCESSING 



^ery QIS Data (Coasts, Rectangle) 



Return URL to file 



requires other GIS data 



User requires Oil Slicks or Frames 



Display Page2a/2b 



(User requires other GIS data) 



(Update Page2a/2b) 



Select an area (ROI or frame) 



highlight the ROI/frame 



process command 



(User requires other GIS data) 



(Improve resolution or change Image) 



Query for GIS data 



Return un. to file 



Query for GIS data (Slicks or Frames) 



(Query for QS data) 



Return URLtofk 



(Return URL to file) 



Execute Script at giver URL 



(^ery for GIS data) 



(Return URL to file) 



(Execute Script at given URL) 



Return Result at URD 



(Return Result at URD) 



Fig. 8. Example: IL internal Interaction between IL components 



The workflow manages the client requests and the catalogue queries, e.g., it 
automatically retrieves the user relevant catalogue information at the start of 
the user session. In the example it triggers the processing function once the user 
has confirmed a pre-selection of a suitable data set. Such data set is identified 
by the catalogue based on a number of parameters provided by the user. It is 
displayed as a vector or sub-sampled image on the user screen for selection. 

The prototype makes use of the Internet InterORB protocol (HOP) for client 
access and has been implemented based on Orbix products. The OpenGIS simple 
feature specifications are under consideration for interfacing external GIS for 
the provision of complementary, non-EO data. At present, this data is stored as 
vector data inside the catalogue component. 



4.2 Application of CORBA Services 

A number of commercially available GORBA service are suitable to be direc- 
tly applied in support to earth observation data and information services. An 
example is the Trader service [16] which can be used as a directory function 
where different catalogues register with a description of the nature of available 
data sets, e.g., European radar imagery, ERS-2 available at the ESA-ESRIN 
catalogue. A catalogue query may thus identify and be routed to the adequate 
catalogue site without running a full query on all sites (see Figure 9). 
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:EU 




Fig. 9. Directory Service and CORBA Trader Service 



Another example is the Event service [16] as illustrated in Figure 10. Well 
specified events are triggered by the ingestion of new data sets into the system 
archive. Events specification may include the source of the data, information on 
its applicability for selected applications, or a first indication on geographical 
coverage. End users (EU) may register at the subscription service for a sub-set 
of such events, and in conjunction with a message box will be notified by the 
system in case such event occurs. The same events may be used for logging and 
accouting, or may act on workflow functions to automatically trigger processing 
functions on the newly ingested data. 

5 Conclusion and Future Issues 

The ’One-size-fits-air approach of today’s state-of-the-art EO information sy- 
stems, i.e. large-scale DL systems, may be adequate for the handling of standard 
EO products. But it leads to overly complex solutions in view of the multitude of 
emerging, very different user communities and application domains. It also risks 
not to meet the adaptability requirements resulting from the future role of VAC, 
SI and VARs, which demands a higher degree of independence to distinguish 
their service offering. An additional system layer, the IL, may provide the adap- 
tability needed, balancing the complexity of individual, smaller systems against 
an adequate level of interoperation among such systems in a federation. Proto- 
type demonstrations indicate that the evolution of the Internet, together with 
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Checks for users registered 
for particular ingestion events 



Create subscription event if ingest 
event matches subscription profile 



Detection of sub- 
scription events 



Subscription 

♦ 



subscribes to particular 
ingestion events 



:EU 



Notifies EU on 
registered ingestion 
events 



Fig. 10. Data Subscription Service and CORBA Event Service 



the already cheaply available processing and storage power, and the emergence 
of open middleware standards, provide the technological basis for federations of 
IL systems yet to develop. 

A number of activities are underway worldwide with space agencies and re- 
lated organisations to advance along this line. In particular, the authors are in- 
volved in the development of a ’Processing and Access/Distribution’ prototype 
[14] and the ESA author has prepared a modelling study for better mapping 
CORBA and EO IL systems [MAAT] which initiated recently. This shall help 
to perform system verification in a real case of an application and user scenario 
(oil pollution monitoring) and in optimising the EO service component model 
to make best use of CORBA. 
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Abstract. This paper describes the institutional and political context 
of a spatial data infrastructure for the Hindu Kush - Himalayan region. 
It then outlines the role and present activities of the International Centre 
for Integrated Mountain Development (ICIMOD) in general and its Mo- 
untain Environment Information Service (MENRIS) Division in particu- 
lar. Some pragmatic steps to build a regional spatial data infrastructure 
that are envisaged by MENRIS are discussed. Emphasis is being put on 
metadata, standards definition and generation of regional key data sets 
at 1:250’000 scale. 
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Himalayan Region 



1 Background 

The Hindu Kush - Himalayan Region, which is the primary area of activity of the 
International Centre for Integrated Mountain Development (ICIMOD), covers 
an area of more than 4 Mio sq. km and habits a population of approximately 150 
Mio people. The region is made up from 8 different countries or parts thereof 
(Afghanistan, Bangladesh, Bhutan, China, India, Myanmar, Nepal, Pakistan). 
Most of the area is sparsely populated and, due to the very limited agricultural 
and industrial potentials, plagued by rampant poverty. Due to the topographic 
difficulties and the remoteness from bigger population centres, the infrastructure 
is weak as well. 

There is an increasing number of trans-houndary concerns in the region: for 
instance the issues of environmental degradation, such as forest depletion and 
soil erosion, but also poverty and migration, are increasingly being recognised to 
be of a regional rather than purely national domain. Moreover, there are some 
proven and more suspected highland - lowland interactions (such as the causes 
of flooding in Bangladesh, [1]) which often lead to emotional debates and po- 
litical tensions in the absence of good, publicly available data and established 
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facts. Finally, there are the questions regarding the utilisation of trans-boundary 
resources such as rivers, which gather increasing political sensitivity as the re- 
sources become scarcer. 

Unfortunately, the relations among some of the neighbouring countries in 
the region are not as good as they could be. Consequently, there is no effective 
institution to tackle the issues of trans-boundary resources at the political level, 
like it had been established in other parts of the world (for instance the Mekong 
Committee, the Rhine Forum, or the Alpine convention, to name but a few). 

Considering the political situation, it is not surprising that data on territory 
and natural resources, like maps or hydrological records, have traditionally been 
highly sensitive material most of the regional countries. In some of them, large 
and medium scale maps are still completely off limits despite the fact that we 
have been down the age of reconnaissance satellites for quite some time. The few 
bits and pieces that are available are often not comparable from one country to 
another, or they are extracts from global datasets like the Digital Chart of the 
World or IGBP’s Global Land Cover [2] data. These datasets are typically of 
a l:lmio scale and have limited suitability for mountain areas with their high 
variability of conditions. 

Finally, the mountain areas are marginal border zones for the bigger countries 
like China, India, Pakistan or Bangladesh, and consequently enjoy a relatively 
low priority on the national political agenda. 

To sum it up: A factual demand for an infrastructure of spatial data at 
medium to small scales (1:100’000 - 1:250’000) on mountain areas in general and 
the Hindu Kush- Himalayan region in particular can be taken for granted [3]. 
They would be helpful to investigate issues of trans-boundary resource use and 
environmental change in a region that, inter alia, is origin of much of the water 
that sustains about 600 million people. 

In the absence of a regional political body, the main users of such an infra- 
structure will be: 

— Scientific institutions like ICIMOD itself, institutions concerned with global 

climate change, or Universities in the region 

~ International Organisations like UNEP or FAO 

— Donor and development organisations (to target development interventions) 

— National Government Institutions (as far as they lack more accurate data) 

However, due to the mentioned security and political concerns, the conditions 
to establish this infrastructure are not favourable. 

2 The Role of ICIMOD 

Out of widespread concerns about environmental degradation and poverty in the 
region, the International Centre for Integrated Mountain Development has been 
established in 1983 as a forum for research, scientific exchange and documenta- 
tion in and on the region as well as advisory services. This has notably been the 
first and - to the author’s knowledge - so far only institution with an explicit focus 
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on this region. The founding fathers were the development co-operation agencies 
of Switzerland and Germany, while UNESCO and His Majesty’s Government of 
Nepal stood patron. Over the years, the support has gradually expanded to in- 
clude about a dozen different donors now. The eight countries of the region are 
Members of ICIMOD; they delegate representatives into the Board of Govern- 
ors. Through the Board, but perhaps even more through manifold contacts with 
scientific institutions and individuals, the centre is firmly anchored in the region. 
However, it has to be noted that ICIMOD’s role is a strictly scientific one, and 
it has to avoid contentious issues. For instance, ICIMOD should not publish any 
maps that depict national boundaries to avoid being dragged into one of the 
many unsettled boundary conflicts between the neighbouring countries. 

Internally, ICIMOD is structured into three thematic divisions (Mountain 
Farming Systems, Mountain Natural Resources, and Mountain Enterprise and 
Infrastructures). The thematic divisions are supported by a Documentation, In- 
formation and Training Service, the Mountain Environment Natural Resources 
Information Service (MENRIS), and the Administrative, Financial and Logisti- 
cal Service. 

ICIMOD’s conceptual approach has gradually evolved from a project- to 
a programme approach, aiming at stronger thematic and regional integration. 
This is being reflected in the 4-year Regional Collaborative Programmes that 
have been taken up in 1995 and 1999. 

3 MENRIS - Past and Present 

The Mountain Natural Resource Information Service (MENRIS) of ICIMOD has 
been established in 1991 with initial support from the Asian Development Bank 
and UNEP- GRID. The objectives have been and continue to be to [4]: 

— establish a network of nodal agencies in the regional member countries and 
serve as a resource centre to them 

— develop a database on geomorphology, soils, land use, vegetation and related 
factors through remote sensing techniques 

— develop mountain-specific applications of CIS 

— facilitate the application of CIS and RS and the use of the MENRIS database 
by the nodal agencies for environmental and natural resources planning, 
management and monitoring 

— improve the co-ordination of regional and project-related mapping and the 
monitoring of projects introduced in the regional member countries by va- 
rious international organisations 

In the first years, the main focus has been on installing the necessary hard- and 
software and on getting acquainted to the technology in ICIMOD itself. This has 
been followed by a phase of capacity building in the region: A substantial training 
programme for professionals has been established gradually and complemented 
by short seminars for managers and policy-makers, and hard- and software has 
been supplied concurrently to partner institutions in the region. Further, a series 
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of case studies has served as demonstration examples and for training purposes. 
More recently, these case studies have started to evolve into support of ’real- 
world- applications’ of GIS in partner institutions. 

However, comparatively little has been achieved in terms of a comprehensive 
geographic database of the region. The case studies referred to above remained 
essentially patchwork confined to particular project areas, and to date there is 
little that MENRIS can offer in terms of homogenous regional spatial data which 
can not be offered by other institutions as well. But there are some rays of hope: 
Since Nepal has adopted an open data policy earlier, MENRIS has been able to 
build an extensive digital database of Nepal at 1:250’000 scale. Bangladesh and 
Bhutan have joined this open data policy recently and it seems that MENRIS 
will be able to populate the regional spatial data base with data from those 
countries as well. Perhaps most significantly for the region as a whole are the 
very encouraging news that have come from India recently: The Government has 
set up an national task force on information technology which recommended, 
inter alia, that the Survey of India makes the existing digital topographic data 
at 1:50’000 and 1:250’000 scales available to the public at no cost and without 
copyright restrictions. However, the Defence Ministry still has to overcome its 
reservations [5]. 

Gonsidering the more mature status of MENRIS itself, the improved institu- 
tional capacities in the region, and the gradual easing of data restrictions, it is 
felt that it is a good time now to earnestly pursue the creation of a regional spa- 
tial data infrastructure. Some of the steps that have been envisaged in MENRIS 
are discussed below, together with some issues of interoperability that will arise 
in the process. The primary objectives of these activities are: 

— to increase the availability and accessibility of relevant geographic data on 

the region 

— to enhance the exchange of geographic information within the region 
The activities broadly fall under one of these categories: 

— capacity building 

— facilitation of data exchange 

— generation of regional key datasets 

Interoperability of GIS in the Hindu Kush-Himalayan region will by and large 
mean the facilitation of data exchange, which is no small achievement, given the 
political and institutional context. Therefore the next section will focus mainly 
on this activity. 



4 Planned MENRIS Activities 1999 - 2002 

4.1 Capacity Building 



The substantial capacity-building programs that have already been started in the 
previous programme will continue under the Regional Gollaborative Programme 
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1999-2002 (RCP-2). However, it is hoped that the already existing curricula 
and training materials and the increasing availability of qualified staff in the 
partner institutions will gradually ease the ICIMOD’s burden in this regard. The 
increasing prevalence of standard computers in government and academic offices 
will also gradually reduce the demands to supply such equipment. However, 
demands for software and special equipment (digitizers, plotters) will remain 
high. 



4.2 Facilitation of Data Exchange 

Metadata Server. A substantial amount of geographic information on the 
Himalayan region has been compiled by many institutions, development co- 
operation projects, and individual researchers. To date, most of it exists in 
analogue form, but there is also a growing number of institutions and projects 
using GIS facilities to compile their own databases. The problem is that this 
valuable information is hardly accessible, especially after the end of the respec- 
tive projects. Moreover, it can be extremely cumbersome to retrieve ancillary 
information; even such basic things as the projection system of a map are often 
unknown. 

To improve the access to existing and new geographic data, MENRIS tries 
to take a lead to provide metadata services to the user community in- and 
outside the region. This has also been one of the recommendations of the Space 
Informatics Seminar 1996 [6] which was held in Kathmandu. 

In a first step, it is planned to document all the MENRIS data holdings. In 
a second phase, other existing data on the region shall be documented as well. 
This would not mean that ICIMOD actually holds that data, it just provides a 
pointer to the holding agency. It goes without saying that we are again dependent 
on the co-operation of our regional partner institutes and the many researchers 
outside the region. Finally, it is also envisaged to make this catalogue accessible 
through Internet. 

The Metadata server shall document the following types of geographic data: 

— interpreted GIS datasets on topography, geology and soils, land cover, hy- 
drography, transportation, administrative units, settlements, socio-economic 
statistics 

— raw and geo-referenced satellite images which have been acquired by MEN- 
RIS or one of the partner institutions 

— air photographs 

— possibly also paper maps on themes as above (not decided yet) 

In addition to that, the metadata server shall contain some general reference 
information like national mapping systems (geodetic datum, projection, sheet 
indices), satellite frame references, locations of GPS base stations, satellite re- 
ceiving stations, addresses of institutions, etc.) 

However, in order to be included in the metadata server, a dataset should 
fulfil certain minimal conditions: 
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— Some degree of comprehensiveness in terms of area coverage and comple- 
teness: While it will be difficult to define an exact limit as to what shall be 
included, it clearly makes no sense to spend time on documenting trials and 
extremely local datasets. 

— The dataset must be accessible, at least under certain conditions that must 
be spelled out clearly. There is no point in documenting data that will not 
be released by the holding agency under any circumstances. 

The metadata server shall include a menu-driven graphical user interface 
which allows simple geographical queries like: ’’what data on landuse exists for 
my region of interest?” or: ”is my area of interest completely covered by a par- 
ticular satellite scene?” This will be done by providing reference data (admi- 
nistrative boundaries, hydrography, topography) from existing global datasets, 
and the footprints of the datasets which are documented. The user will be enab- 
led to select the reference data that he wants to display, and this will be filtered 
according to the currant map scale. Then he can select an area of interest and 
query the metadata base according to data type (as above), keyword (e.g. geo- 
logy / landcover / topography etc.), scale, date, etc. and display the metadata 
of the query results. 

Development and Promotion of Standardised Applications. It is also 
planned to develop a number of standardised applications that can be adopted 
by institutions in the region. This would improve the comparability of geogra- 
phic information in the region. The focus will be on applications that produce 
information relating to phenomena that are a primary concern in large parts of 
the region: 

Land cover mapping and monitoring from satellite images A system to 
incorporate DTMs, agro-ecological zonations and previous land cover maps 
into a digital classification shall be developed. A primary requirement to 
ensure compatibility with other datasets is a precise rectification, including 
terrain distortions. In addition to the land cover map according to the stan- 
dards as described in 3.1, the approach should also yield some measure of 
reliability as output. 

Inventorying and monitoring of glacier lakes Glacier lakes, or actually 
the risk of their sudden outburst (GLOFs), pose a serious threat to many 
settlements and infrastructures like hydroelectricity plants in the region. The 
use of satellite images for monitoring them has been demonstrated succes- 
sfully [7] it is now a matter of standardising the methodology and promoting 
it. 

Biodiversity mapping and monitoring The preservation of the region’s 
richness in biodiversity is a growing concern [8]. Yet there is not much 
known as to where exactly the ’hot spots’ are that deserve particular at- 
tention. Some methods to use satellite images to assist in the mapping of 
biodiversity have been developed elsewhere [9], but need to be adapted to 
the extreme variability of the Himalayas. 
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4.3 Development of a Regional Geographic Database 

Adaptation of Standards. With regard to the envisaged regional geographic 
database, standards are required to ensure the compatibility of the data. To save 
time and effort and to achieve interoperability with the ’outside world’, existing 
standards shall be adopted as far as possible and modified only where absolutely 
necessary. 

Standards are required primarily in three fields: 

— Topographic base data 

In order to produce a homogenous topographic database of the region, the 
individual elements should be clearly defined (e.g. what is considered a hig- 
hway, what is a secondary road) . Such standards already exist and could be 
adopted without much change. However, one has to bear in mind that the 
topographic database will be compiled from very different primary sources 
with their own inherent standards which will not always easily translate, and 
the various national mapping agencies will show little inclination to change 
their own standards in the near future. Thus the pragmatic short-term solu- 
tion will be to find out what is the smallest common denominator. However, a 
more proactive role can be taken with regard to the purely technical aspects, 
like data format, digitising accuracy, etc. 

Thus the issues (or inhibitors) of interoperability are largely semantic ones, 
as far as topographic data standards are concerned. This is not surprising, gi- 
ven the fact the region is composed of a multitude of very different cultures, 
which attach very different meanings and conceptualisations to seemingly 
simple features like ’forest’ or ’river’. 

— Metadata 

The situation looks more promising with regard to metadata standards. Since 
this is a relatively new topic, there is not much of a legacy to be carried with, 
and a more forward-looking approach can be taken. In view of existing tools 
to create metadata entries, a preliminary decision has been made to adopt 
the FGDC metadata standard [10]. This would also allow interoperation 
with metadata parsers that are based on the much more confined DIF stan- 
dard [11]. However, further consultations with the partner institutions will 
be required, and it has to be accepted that many fields of records on existing 
geographic data will remain blank, because the relevant information can not 
be retrieved any more. 

— Land cover classification 

The idea here is to develop a land cover classification scheme that can be 
implemented with reasonable accuracy by digital classification of satellite 
imagery. Since the envisaged scale is 1:250’000, imagery of medium to low 
resolution, such as IRS-WIFS or NOAA AVHRR shall be used. The emphasis 
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will be on relatively frequent monitoring of large areas rather than detailed 
land use mapping for the purpose of planning; hence the precision of the 
classes will be limited. What is important is ’upward compatibility’ to global 
land cover classifications, like those of IGBP. 

Generation of Homogenous Key Datasets. ICIMOD is also trying to esta- 
blish a homogenous regional spatial database at a scale of 1:250’000. This scale 
is larger than existing global datasets which are mostly at l:lmio scale or smal- 
ler, and maps of this scale are gradually being released by some of the regional 
countries. The database shall contain data which play a key role in ICIMOD’s 
and other institution’s research on mountain development and mountain envi- 
ronment. Some elements are: 

— administrative units (mainly as a base to visualise statistical data) 

— transportation network 

— hydrography: rivers and hydrological records (as far as available) 

— a Digital Elevation Model (DEM) - an essential item in almost all mountain- 
related GIS and RS applications: Modelling of soil erosion, slope instability, 
acceptable land use intensity, hydrological flows, but also precise processing 
of satellite images all require a DEM of suitable resolution. 

The GTOPO30 model of the USGS is currently the only model that covers 
the whole region. However, NASA/JPL are planning to acquire new InSAR 
data through a shuttle mission in 1999 and produce a global DEM of 1” 
resolution [12]. It is understood that a reduced-resolution version (3”) will 
be made available at nominal cost, which will be an enormous benefit to IGI- 
MOD and the region. The resolution of 3” would also allow the computation 
of derivatives (slope, aspect) at sufficient accuracy for the envisaged scale of 
1:250’000. 

— Land cover 

The Indian IRS WiFS data of 188 m resolution seems to offer a good com- 
promise between resolution and manageability of the amount of data, and 
it would fit neatly into the regional database of 250 000 scale. However, the 
only two spectral bands pose a limit with regard to interpretability. 

First trial classification have indicated that water bodies, snow, barren land, 
sparse vegetation, forests, rainfed and irrigated agriculture can be differen- 
tiated. It is hoped that the use of auxiliary data such as agro-ecological 
zonations and DEMs can improve the classification. On the other hand, the 
combination with satellite land cover data will also help to evaluate the agro- 
ecological zonation. 

Major problems are the availability of satellite images (both technically and 
logistically), and the sheer impossibility to do any ground truth studies at 
this scale. 
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~ Inventory of biodiversity and protected areas 
— Inventory of glacier lakes 

Since the staffing and funding situation of MENRIS is extremely constrained, 
a collaborative approach has to be taken. This means that most of the actual 
work to build the databases has to be done by the partner institutions in the 
region. The standards and standardised methodologies as listed above are ex- 
pected to help in the process, but the role of ICIMOD will be limited to one 
of a catalyst. Moreover, it has to be kept in mind that ICIMOD does not have 
a mandate to enforce anything on the partner institutions - it has to rely on 
the goodwill of the partners and, to some limited extent, on the ’co-ordinating 
power of money’ it can disburse under various programmes. 

Possible applications of these data sets range from priority area selection 
for social development through policy information on land use, infrastructure, 
energy and natural resources to climate modelling. Potential users are the the- 
matic divisions within ICIMOD itself, other national and international research 
institutions, and international organisations like FAO, UNDP, UNEP, with their 
regional committees, but also NGOs and Donor organisations. Last, but not 
least, national Governments and sub-national authorities can make use of the 
data. In order to enable them to do so, it is essential that the data be available 
at nominal cost. 



5 Conclusions and Perspectives 

The demand for good, reliable and homogenous geographic data on mountain 
areas has been established clearly in scientific circles and international organi- 
sations. However, in the case of the Hindu Kush - Himalayan region there is at 
present hardly any corresponding drive from the governing political institutions. 
The prevailing security concerns have traditionally been inhibiting the creation 
of and dissemination of geographic data. 

As those restrictions are gradually easing, more qualified manpower becomes 
available, and an increasing number of local and national geographic databases 
are being built in the region, it is felt that the time is right to start building a 
regional spatial data infrastructure for the benefit of scientific and international 
organisations. In the absence of large budgets and grand designs, a pragmatic 
approach shall be taken by MENRIS/ICIMOD to integrate existing pieces and 
improve their accessibility. It is envisaged that MENRIS will assume a clea- 
ringhouse function for the region mainly by providing metadata services and 
continued networking of professionals in the region. 

A limited number of generic datasets shall be created from remote sensing 
images. 
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Abstract. The advent of interoperating GISs has many implications 
for education. While there are certainly important issues to discuss with 
regards to additions to the curriculum which address the technological 
and institutional impacts of interoperating GISs, this paper focuses on 
the theme of interoperability for education. Interoperability provides a 
context for the development of shareable education materials which in 
turn allow for collaborative education in a field in which rapid techno- 
logical developments are making it difficult for individual instructors to 
stay up-to-date with both the science and the related technologies. Such 
collaborative education initiatives raise many issues, both technical and 
institutional, but a number of existing projects provide some basis for ra- 
pid developments. An international effort to create an infrastructure for 
the development and distribution of interoperable, shareable GIS educa- 
tion materials is described. 



1 Introduction 

In December 1997, the National Center for Geographic Information and Analy- 
sis (NCGIA) and the Open GIS Consortium (OGC) convened an international 
conference and workshop on Interoperating Geographic Information Systems (In- 
terop’97). Topics addressed at Interop’97 included the current state of research 
in related disciplines concerning the technical, semantic, and organizational is- 
sues of GIS interoperation; case studies of GIS interoperation; theoretical frame- 
works for interoperation; and evaluations of alternative approaches [4]. Arising 
from these discussions about GISystems interoperation was an awareness that 
interoperation might have important implications for GIS education. 

Many of the measures of the success of interoperation identified at Interop ’97 
are specified as measurable changes in the content of GIS courses. This suggests 
that GIS education may become an unwitting accomplice in the move to intero- 
peration. However, an alternate view may be that GIS education will become 
a fortunate beneficiary. The vision of interoperating GISs foresees ubiquitous 
GIS and the corresponding necessary pervasive spatial thinking and awareness. 
The same vision also acknowledges that success in interoperability means that 
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there are many things which will no longer need to be learned. How must GIS 
education change with interoperability? 

There are two perspectives to consider in this context: 1) Interoperability 
and GIS education, and 2) Interoperability for GIS education. While the first 
of these perspectives is an important growing theme for GIS educators [6], this 
paper focuses on the second perspective - interoperability for GIS education. 
The motivation for this interest comes from a recognition that GIS educators 
in the private and public sectors are faced with both an opportunity and a di- 
lemma. As the GIS vendors move to open systems which can be integrated with 
many traditional operations, the use of spatial data and analysis will become 
widespread throughout business, government and education. Hence the need for 
GIScience education is expanding rapidly. However, at the same time, rapid 
changes are occurring in both GIS technology and the structure of higher educa- 
tion. These shifting foundations make it impossible for individual GIS educators 
to stay on the leading technological edge where their students need them to be. 
Gollaboration in education is now essential. 

Given the urgency of these issues and the need to begin considering the 
education community’s response, an international workshop on Interoperability 
for GIScience Education (IGE’98) was organized soon after these issues were first 
discussed at Interop’97. This workshop was held in Soesterberg, The Netherlands 
on May 18-20, 1998 [8]. This paper examines the issues raised at this meeting and 
outlines various existing and new activities in the context of these discussions. 

2 The Opportunity 

GI and its associated technologies are migrating outward from the specialist ni- 
che markets in which they have been embedded over the last 20 years [1]. This 
means that a greater number of individuals are going to need to work with the 
technology in their everyday lives. Eventually this interfacing will be seamless, 
as users are able to perform high tech spatial tasks via intuitive interfaces. Ho- 
wever, that points lies some time in the future. In the meantime the education 
community will need to provide a broad based education strategy to deal with 
this growth in demand. 

Many educators in both the public and private sector are already responding 
to this challenge in their own individual ways by providing: 

— Web resources such as the Virtual Geography Department and the NGGIA 
Gore Gurricula. 

— Flexible education programs such as those provided by distance learning 
(e.g. UNIGIS), 

~ Virtual learning centers such as the Western Governors’ University and 
ESRI’s Virtual Gampus. 

3 The Dilemma 

However, all of this is being done against a background where: 
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~ A significant percentage of GI knowledge, particularly as it relates to the 
technology, becomes outdated within less than 6 months. 

— New GI products, services and ideas are appearing at a rate beyond any one 
individual’s ability to keep track. 

— It is impossible for an individual educator to stay at the technological leading 
edge in their field and to keep their learning materials up-to-date. 

— The model of higher learning is changing from a traditional, one-time-through 
university education experience to a flexible lifelong learning environment. 

— Mature, busy students are demanding effective and efficient learning oppor- 
tunities. 

— Many students are no longer satisfied with the talk and chalk approach to 
university education. 

— Professionally designed education products now compete against traditional 
one-off materials. 

— Gentral support for traditional education institutions is shrinking while for- 
profit education institutions are beginning to compete for the growing num- 
ber of mature students. 

— An increase in demand for just-in-time education is apparent in both acade- 
mic and industry settings. 

— Goncern is increasing over how the quality of educational GI programs can 
be maintained in light of decreasing budgets and rising student demand. 

The aim of the Soesterberg meeting was to explore how the GI community 
can work together to develop an Interoperable or Open environment in which 
educators can exchange resources and add value to these resources for use in 
their own unique educational settings while at the same time retaining intellec- 
tual (and commercial) copyright. Gan such an enterprise provide a framework 
for collaborative education which allows GIS educators to stay on the leading 
edge of both the technology and the changes happening in higher education? 
Both technical issues, such as metadata, data formats and technology, and edu- 
cational/institutional issues related to collaborative education and sharing of 
resources need to be considered. 

4 What Does Interoperable Education Mean? 

While the use of the term interoperability within the context of education may 
be misleading, we have found that it is a useful shorthand for describing the need 
for creating materials which are shareable and can have multiple uses in various 
contexts from traditional instructor-led teaching to independent self-directed 
learning. Gertainly, in most cases, materials will not need to be inter operable 
in a functional sense, but the idea of being able to assemble diverse materials 
quickly for use in a specific education context does reflect at least one of the 
objectives of open architectures. In order to make materials interoperable in 
this sense, some of the development processes of the OpenGIS consortium may 
be useful. We need to have some understanding of the primary components or 
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building blocks of educational events and we need to have some common ways 
of describing their properties. 

Educational interoperability will not come easily. There are a number of 
problems with the concept of education interoperability which are common to 
education in general and others which are specific to GIS. 

4.1 The Technological Basis of GIS 

Since GIScience is based on a continually evolving technology, technical foun- 
dations and even some concepts change rapidly making it difficult to justify 
investing too many resources in the development of shareable education materi- 
als. How can we separate education about the technology from education about 
the concepts so that elements which do not change do not need to be constantly 
revised? Gan we achieve this by breaking education materials into several smal- 
ler components? How small do these components need to be (i.e. what level of 
granularity is needed)? Additionally, since technology is central, open concepts 
are relevant for both our education materials and our technology and data. Is 
there some overlap between open education systems and open GIS technology 
of which we can take advantage? 

4.2 Problems of Localization and Generalization in GIS 

In the GIS education context, both concepts and data need to be localized in 
order to address: 

— Differences imposed by local and federal institutions and regulations. 

— Differing national data models, formats and standards, including semantic 
variations in classification systems. 

— Gultural differences between both different geographic regions and different 
disciplines. 

— Language variations both within and between language groups (i.e. South 
American Spanish versus European Spanish). 

This leads to questions about what can and should be localized and what not. 
As well, many concepts and general topics such as geocoding and street networks 
are not easily generalized across various geographic regions. Is there some way 
to separate the general from the specific when preparing education materials for 
general use? Gan this be achieved by separating concepts from context? Gan we 
identify a level of granularity which achieves this? 

4.3 The Multidisciplinary Nature of GIS 

The way in which GIS is used differs considerably across disciplines. Thus, what 
needs to be learned also varies. As well, although GIS is assumed to have almost 
universal application, there is a general lack of spatial literacy. Is it possible to 
determine the fundamental core needed and to teach it broadly and generically 
across all disciplines? 
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4.4 The International Character of GIS 

Given that there are only a handful of major GISystems used worldwide and 
that international standards for open systems and data exchange are currently 
being developed, the potential for materials developed by any instructor to be 
useful to colleagues around the world is quite high. As a result the sharing 
of education materials is already a well established state of affairs in the GIS 
education community [9]. This has moved the community to a critical stage 
at which it is now essential to identify and address education interoperability 
problems and issues. 

4.5 Institutional Issues for Education Generally 

While we would like to be able to share materials internationally, different incen- 
tive models for contribution create barriers to the type of international collabo- 
rative projects needed to make interoperable education work. At a minimum, the 
need for intellectual property protection and for financial return on investment 
of time vary considerably between the US and Europe. These models need to be 
clearly specified so that these differences can be accounted for when planning 
and conducting collaborative projects. In addition, as education materials are 
developed by various kinds of institutions, both private and public, mechanisms 
for promoting collaboration while providing for financial transactions between 
them are needed. 

A further institutional issue relates to shifting education paradigms [2]. A 
large repository of on-line interoperable education materials provides an oppor- 
tunity to move from “just-in-case” to “just-in-time” to “just-for-you” education, 
but not all educational institutions are prepared for these kinds of delivery me- 
chanisms. 

Finally, questions of granularity are not only technological, but they also 
need to be discussed at the institutional level. What is the appropriate level of 
interoperability from the institutional perspective? Should it be at the course 
level, the unit level, the exercise level or the component level? How can several 
different institutions benefit from shared instructional enterprises? Are different 
interoperable mechanisms needed for each level? 

4.6 International Issues for Education Generally 

Since GIS is international in character, it follows that any interoperable edu- 
cation activities need to account for international differences in education in 
general. These range from the obvious problem of language differences to more 
subtle issues of different education styles. Attention will need to be given to ap- 
propriate infrastructures and components for educational materials so that they 
will suit educational needs worldwide. Gan materials prepared in English simply 
be translated to other languages? Are there vocabularies and/or dictionaries for 
GIS technical terms in all languages? Are there differences in how educational 
experiences should be structured in other regions? How might these differences 
be accounted for in the definition of education components? 
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5 Some Solutions - Materials Development 

There is an urgent need for collaborative efforts and mechanisms which will 
assist in the discovery and use of the diverse but relevant educational materials 
already available on the web. Fortunately, the concept of interoperability for GIS 
education does not exist in a vacuum. A number of major and important projects 
are currently underway which provide significant foundations for GIS education 
interoperability. This and the next section provide brief descriptions of these 
projects within the context of interoperability for GIS education. The first section 
considers collaborative projects which are now developing shareable materials. 
The next section discusses some projects which are providing infrastructures for 
collaboration. 

5.1 The Virtual Geography Department Project 

The Virtual Geography Department (VGD) Project was begun in 1995 at the 
University of Texas Austin under the leadership of Professor Kenneth Foote 
with three years of funding from the National Science Foundation (NSF). It is 
an excellent example of collaboration in the development of education materials. 
While the topics range across the full spectrum of geographical inquiry, GIS 
materials are a major component of this resource. It is freely accessible via the 
web at http://www.utexas.edu/depts/grg/virtdept/contents.html [3]. 

The VGD provides a web-based clearinghouse of learning materials. Exten- 
ded summer workshops have lead to the development of a common framework 
for the format and design of these materials. Stress is placed on the integration 
of curriculum through the creation and presentation of a range of teaching ma- 
terials, including on-line course syllabi, texts, exercises, fieldwork activities and 
resource materials. While existing materials are linked through this framework, 
the development of new materials is encouraged. Materials are packaged using 
a standardized cover page including an abstract, table of contents, facts of pu- 
blication and instructor’s notes. This arrangement is particularly useful for the 
sharing of short, ephemeral materials which would not otherwise be published 
or distributed outside a single university department. 

Within the spectrum of shareable materials development, the VGD sits at the 
extreme altruistic end. Gontributors receive some limited recognition for their 
work and an opportunity to distribute useful materials more widely, but there is 
no monetary return for their effort. In fact, most of the contributors have been 
participants in the summer workshops where they learned how to put educational 
materials on the web. Their development of materials for the clearinghouse is 
thus an exercise in implementing that new knowledge. The ability of this project 
to survive past the end of the original project funding will demonstrate whether 
no-cost services such as this can be viable over the long-term. 

5.2 The NCGIA Gore Gurricula 

Like the original NCGIA Core Curriculum in GIS, the new on-line Gore Gurricu- 
lum in GIScience (GISGG - http://www.ncgia.ucsb.edu/giscc) and the Gore 
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Curriculum for Technical Programs (CCTP - http://www.ncgia.ucsb.edu/cctp) 
currently under development will each be composed of over 50 units of materials 
organized as lecture notes, instructor’s notes and supporting materials [7]. All 
materials are freely available on the web with development supported by base 
funding of the NCGIA in the case of the GISCC and with funding from NSF for 
the GGTP. In keeping with the spirit and success of the original Gore Gurriculum 
and to meet the same specific need in the GIS education materials market, the 
new Gore Gurricula concentrate solely on providing fundamental course content 
assistance for educators - formally as lecture materials, but adaptable for wha- 
tever instructional mode each course instructor wishes to use. Thus, as before, 
they are not comprehensive textbooks for students, nor are the materials desi- 
gned to be used as distance learning materials. Instructors are encouraged to 
pick and choose amongst the materials on offer in order to develop courses sui- 
ted specifically for their own students. Gourse design and materials presentation 
remain the responsibility of individual instructors. 

Recognizing the need to reward contributors, the editorial procedure for the 
new GIS Gore Gurriculum was initially based on a journal metaphor. Each unit 
was to be overseen by a section editor, reviewed by peers and revised accordingly 
before being posted to the website. Authorship is clearly indicated and the for- 
mat for citations given at the end of each unit. This procedure was put in place 
specifically to provide a strong academic incentive for contribution. Unfortuna- 
tely, the incentives of citations and refereed publication have not proven strong 
enough to move commitments to prepare units to the top of most pledged aut- 
hors’ to-do lists. It was hoped that the GISGG would be fully populated within 
a year of its formal initiation, but as of June 1998, 2 years later, only 25 of the 
originally proposed 187 units were publicly posted. As a result, a new editorial 
procedure has now streamlined the process by removing the unrewarding section 
editor positions and offering additional rewards to authors in the form of NG- 
GIA publications. By the end of 1998, over half of the units in a revised, shorter 
collection of units have been prepared. 

At a minimum, the materials in the NGGIA’s Gore Gurricula will be signi- 
ficant contributions to the global GIS education materials database. Since the 
materials are developed and distributed on the web, each unit can be easily 
tagged with appropriate metadata once specifications are complete. In terms of 
granularity, having units based on a single classroom session allows considerable 
flexibility in the organization of topics for a course. 



5.3 The UNIPHORM Project 

The UNIPHORM Project is funded under the EU PHARE Program in Multi 
Gountry Distance Education and has as its objective the development of course 
materials and of a service to support distance education in Open GIS for pro- 
fessionals. The partner institutions are the UNIGIS sites at Manchester/Hud- 
dersfield, Salzburg, Sopron, Bucharest and Debrecen and the PHARE Study 
Gentres at Miskolc and GDOEGS, Bucharest. The remit of the project is for the 
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development of course materials at the UNIGIS sites and the subsequent deli- 
very through PHARE study centers. It is yet another example of a collaborative 
materials development effort. 

Since there are many partners involved in this project, materials development 
is organized around two fundamental components. First, topics are organized as 
a hierarchical tree. Each leaf or node on the tree contains materials which are 
developed as a set of PowerPoint slides. Thus flexibility exists in how individual 
slides can be organized both within a single leaf or node, or within the hierarchy 
of topics. The smallest interchangeable unit developed is a single slide and the 
related page of material provided to students. 

This template is designed to allow a high level of control amongst several 
institutions and authors and yet retain a high level of flexibility for course en- 
gineering. It imposes some restrictions in the way material has to be structured 
and presented but the payoff is substantial in terms of managing complex and 
shifting resources and in providing cheap and effective creation and delivery of 
courses. 

5.4 UNIGIS 

UNIGIS is an international network of universities which together offer a post- 
graduate diploma and MSc in GIS by distance learning methods (http://www. 
unigis . org) Students complete ten modules each of which covers a substantive 
GIS topic and may elect to complete a research project to qualify for the MSc 
diploma [5]. The UNIGIS program has been taught from the UK since 1991 and 
thus its developers have already experienced many of the problems associated 
here with GIS education interoperability. Some of these include: 



The need for a sustainable business model. Much of the material on the 
web is freely available and there has so far been a laudable ethos that the web is 
an arena for sharing knowledge. Visitors to the UNIGIS site, however, will And 
that most of the materials are behind a password which is available only to their 
students. The reason for this is, quite simply, that UNIGIS is a business. Within 
the UNIGIS network a system of royalty fees and concept payments allows those 
sites which originate materials to receive a return for their effort while giving 
other sites access to materials which they could not themselves have generated. 
Development of a generic “web market” would also allow UNIGIS to buy-in from 
other providers modules which cannot be generated internally. 



The instability of the web for teaching purposes. While there is a huge 
amount of material already available for teaching purposes on the web, may pro- 
blems exist. The quality of material is not guaranteed, there being no equivalent 
of peer review on the web. The continuing availability of material is not guaran- 
teed - it may be that the great site upon which you’ve based your lecture may be 
taken ’off the air’ by the site owner tomorrow. The legality of linking materials 
from web sites into ones own pages sometimes gives one pause for thought - 
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do many academics actually understand the copyright implications of using web 
material? 



The importance of cultural differences. Although GIS is often regarded as 
a technical subject, it is of course embedded in national and linguistic contexts. 
At present the UNIGIS materials have been authored primarily in the UK and 
so non-UK sites have the task of customizing these core materials to fit their 
local circumstances. This local customization process is not a trivial task. 

6 Some Solutions - Infrastructures for Materials 
Development 

6.1 The ESRI Virtual Campus and Knowledge Base 

ESRI’s Virtual Gampus (http : / / campus . esri . com) is the first strong conten- 
der in private sector on-line GIS training. As might be expected the materials 
are very well designed, pleasant to use, very reasonably priced and extremely re- 
sponsive to the market needs. The Gampus was launched in 1997 and according 
to ESRI sources attracted over 1200 student in the first nine months. Gurrently 
the Gampus offers several short, interactive on-line training courses in ArcView 
GIS. Each course is designed in a similar manner and contains well-structured 
content, examples, exercises and a short multiple-choice exam. The materials 
have been widely acclaimed and several universities have used or are considering 
using them as components in traditional campus-based courses (e.g., the Vrije 
Universiteit Amsterdam included ESRI’s introduction to ArcView course in a 
recent campus course) 

More important for the infrastructure of GIS education interoperability is 
ESRI’s Knowledge Base. This is a database of GIS concepts, examples, exercises, 
and test questions which can be used to build learning situations within the 
context of their Virtual Gampus. Materials within the database are structured 
in a uniform manner and adhere to a standardized set of component types. 
Gourses can be quickly constructed out of these building blocks by determining 
what generic components and sequence is needed and then using the Knowledge 
Base to find and select a module for each specific element. The company plans 
to contract with third party authors who will create materials for the Knowledge 
Base. The authoring program will use a business model that will allow external 
authors to receive royalties when their materials are used in the Virtual Gampus. 



6.2 Instructional Management Systems (IMS) 

The Instructional Management Systems Project (http : / /www . imspro j ect . org) 
represents a consortium of government, academic and commercial organizations 
who are developing a set of specifications and prototype software for facilita- 
ting the growth and viability of distributed learning on the Internet. Briefly, 
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IMS seeks to provide a complete environment for the management of educa- 
tion materials, learning and administration. While it provides many facilities of 
general significance, the elements of the project of particular relevance to GIS 
education interoperability are their metadata specifications and mechanisms for 
authentication and commerce. 

The IMS base metadata specifications have recently been approved by IEEE. 
IMS metadata properties that describe educational content include: Discipline, 
Concept, Coverage, Type, Approach, Granularity, Structure, Interaction Quality, 
Semantic Density, Presentation, Role, Prerequisites, Educational Objectives, Le- 
vel, Difficulty, Duration. Recognizing that different disciplines will have different 
needs, the development of discipline-specific schemas and property definitions is 
encouraged. Given that the GIS education community is extremely international 
and active, the IMS project team has recognized that GIS provides an opportu- 
nity for the rapid, early development of some important demonstration activities. 

Other aspects of the IMS project which are of interest to interoperable GIS 
education include: 

~ how IMS addresses the issue of intellectual property 

— how lineage is recorded in the metadata descriptions 

— authentication and how rights of access and use will be managed 

— the commerce model being developed to handle financial transactions 

— how IMS may provide assurance of quality through the use of review bo- 
dies (similar to the Michelan stars system), usage records and assessment of 
educational outcomes by external bodies. 

From the perspective of the GIS education community, IMS does seem to pro- 
vide some important immediate solutions for interoperability. In particular, the 
metadata schema provides a standardized means of describing and cataloguing 
the vast range of resources already available on the web. With the promised ad- 
vent of metadata search engines on the web, properly tagged HTML documents 
could conceivably be discovered anywhere on the Internet. However, there are a 
number of issues of particular concern to GIS educators which may need special 
attention: 

— There is a need to separate content from infrastructure. 

— There is an acknowledged distinction between training and education. Gan 
these share a common set of standards? Is all learning the same? 

— There are several layers of interoperability needed: from technological (ob- 
jects communicating) to semantics to institutional. Likewise, IMS is middle- 
ware in which technology is the foundation, policy and institutional matters 
are above this. 

— Given the need for localization in geographic information science, how should 
learning profiles or educational settings be matched to metadata? Gan we 
establish hierarchical schemas in metadata to address geographical or disci- 
plinary foci? 
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~ What is the appropriate level of granularity given the need for localization? 
Can we use nested hierarchies in metadata to address this? Can small objects 
be viable? 

— IMS may provide the mechanisms needed to address the incentive problems. 



6.3 The Open GIS Consortium (OGC) 

Although OpenCIS specifications are not designed to meet instructional needs in 
particular, the inclusion of functional geoprocessing components in instructional 
materials points at the need to ensure that GIS educators who are developing in- 
teroperable education materials consider these new geoprocessing specifications 
during their materials development. 

Like IMS, OGC provides a proven model for the development of community- 
wide specifications. Certainly an Education SIC in OGC would provide a vehicle 
for discussions of geoprocessing interoperability as it applies to education, howe- 
ver, the need for such domain specific geoprocessing specifications is not clear. 
On the other hand, there are some education needs which relate to OpenCIS. 
Merging interoperable educational services (from IMS) with interoperable GI 
services (from OGC) seems doable now. Interoperability in education is pretty 
much the same across domains, and GIS interoperability is not different for edu- 
cational purposes. But in order to make products appear, both sides need to 
be aware of each other, providing input to IMS metadata definitions or OGC 
topics, and helping to explore and define business models. 

7 What’s Missing? What Do We Need? 

While each of these projects presents effective responses to various education 
problems, when considered across the spectrum of issues identified earlier, there 
remain unresolved problems. Still needed are: 

— A clear picture of the various incentive models which can be used to encou- 
rage participation in collaborative projects supporting interoperability for 
education and which will lead to long-term sustainability of such activities. 

— Identification of the appropriate granularity and specification of the range of 
educational component types (e.g. exams, units, concept modules, exercises, 
applets) 

— Models for the development of shareable GIS educational materials which 
address the issues of generalization, localization and technological change. 

— A fully functioning prototype of a database of education components which 
can be combined into various types of educational “events” . 

— Mechanisms for structuring shareable education components. 

— The creation of a huge volume of relevant metadata associated with educa- 
tion materials already on-line. 

— Attention to the critical but still unexplored issue of language differences in 
the context of shareable GIS education materials. 
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— Mechanisms for quality control. 

— Digital “wizards” which would assist in the creation of metadata and the 
construction of education “events”. 

~ A change in the education system which will support new education para- 
digms involving collaborative activities by educators. 

— Dissemination of information about active projects and mechanisms which 
support collaborative and interoperable education. 

— Information included in metadata which describe the context of shareable 
education objects, i.e. how is it used, what is the audience, what do people 
think who have used it? 

8 Next Steps 

The technology to allow the delivery of interoperable educational objects via the 
web will become available very soon. Furthermore the support for on-line learning 
and collaborative teaching from governments and significant higher education 
bodies is such that it is inevitable that such programs will expand dramatically 
in the short-term. Thus we should work to ensure that GIScience educators are 
as well placed as possible to take advantage of the opportunities, and avoid the 
pitfalls, which the shift towards on-line and collaborative teaching will gene- 
rate. Three areas for concerted effort by the GI education community have been 
identified for immediate action. 



8.1 Metadata for GIScience Education Materials 

A major concern must be to ensure that the interests of GIScience are strongly 
represented in the super-disciplinary projects which are presently laying down 
the ground-rules for collaborative, on-line education interoperability. Just as the 
OGG are presently acting as a lobbying and technical development group to 
ensure that ’geography’ is properly accommodated within emerging distributed 
computer environments, so too GIScience educators need to lobby to ensure 
that GIScience is properly represented in emerging on-line and collaborative 
educational initiatives. In particular, an international task force of GIS educators 
will soon begin work with the IMS project team to examine and refine where 
necessary the generic IMS metadata specifications. These extended specifications 
will be reviewed by the international community and, it is hoped, will lead to the 
preparation of structured metadata for large quantities of education materials 
already available on-line. 



8.2 A Prototype Knowledge Base for GIScience Education 
Materials 



Glearly, it is desirable to generate as quickly as possible a prototype GIScience 
knowledge base in order to learn what workloads are involved in creating such a 
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structure and to create an exemplar which can be used at conferences and works- 
hops to generate wider awareness. The potential of the work already done by the 
ESRI knowledge base project is impressive, thus, rather than create an entirely 
new structure, efforts are now underway to test the adaptability of this struc- 
ture designed for proprietary training materials to its use with generic education 
materials “outside the ESRI firewall”. 

8.3 Incentive Models for Interoperable GIS Education 

The move towards on-line and collaborative GIScience education should not be 
viewed purely, or even primarily, as a technical issue. For such GIScience educa- 
tion to be successful, academics will need to feel that it is worthwhile to spend 
their time creating shareable education objects. Higher Education institutions 
will need to be able to see how revenue might be generated from collaborative 
teaching initiatives. In other words, the incentives which might make the deve- 
lopment and use of shareable GI education resources take-off need to be explored. 
A third task, therefore, is to research the motivations which lie behind current 
on-line resources initiatives and to try to anticipate what incentives might be 
necessary in future to encourage GIScience interoperable education projects to 
develop strongly. 

9 Conclusion 

The concept of interoperability as it pertains to making GIS education resources 
shareable and easily accessible is a goal worth pursing. Following the Soester- 
berg meeting, a small working group continues to advance work on these tasks. 
Progress on some, if not all, of these tasks will be observed through 1999. The 
involvement of the international GIS education community will be pursued. Ho- 
wever, while a global effort might be possible, at a minimum, general understan- 
ding of a concept of interoperable GIS education and the specification of models 
for the development of appropriate education “objects” will make sharing our 
global education resources more feasible and productive. 

10 Acknowledgements 

The authors wish to acknowledge the contribution of all the participants at the 
Soesterberg meeting whose lively discussions and thoughtful responses provided 
the basis of what is presented in this paper. Partial funding for the meeting 
and subsequent development of this paper was provided by the National Science 
Foundation. Support for the meeting was also provided by the UNIGIS Gonsor- 
tium, Hewlett-Packard Netherlands and the Vrije Universiteit Amsterdam. 




114 K.K. Kemp, D.E. Reeve, and D.I. Heywood 



References 

1. anonymous (1998). Industry Outlook ’99; GIS melts into IT. GEOWorld. 11: 40-49. 

2. Denning, P. (1996). Business Designs for the New University. Educom Review 31(6): 
20-30. 

3. Foote, K. E. (1997). The Geographer’s Craft: Teaching GIS in the Web. Transac- 
tions in GIS2{2): 137-150. 

4. Goodchild, M. F., M. J. Egenhofer, R. Fegeas and C. A. Kottmann, eds. (1999). 
Interoperating Geographic Information Systems. New York, Kluwer. 

5. Heywood, D. L, S. C. Cornelius and P. H. Cremers (1998). Developing a virtual 
campus for UNIGIS: an international distance learning programme for geographical 
information professional. In anonymous, ed. Bringing Information Technology into 
Education. Dortecht, Kluwer. 

6. Heywood, D. L, K. K. Kemp and D. E. Reeve (1999). Interoperable education 
for interoperable GIS. In M. F. Goodchild, M. J. Egenhofer, R. Fegeas and C. A. 
Kottmann, eds. Interoperating Geographic Information Systems. New York, Klu- 
wer: 443-458. 

7. Kemp, K. K. (1997). The NCGIA Core Curricula in GIS and Remote Sensing. 
Transactions in GIS 2(2): 181-190. 

8. Kemp, K. K., D. E. Reeve and D. I. Heywood (1998). Report of the International 
Workshop on Interoperability for GIScience Education, IGE ’98, Soesterberg, The 
Netherlands, May 18-20, 1998. National Center for Geographic Information and 
Analysis, University of California Santa Barbara. 

9. Kemp, K. K. and D. J. Unwin (1997). Guest Editorial. From geographic informa- 
tion systems to geographic information studies: An agenda for educators. Transac- 
tions in GIS2{2): 103-109. 

10. Miller, W. 1998. Personal communication. May 1998. 




Adding an Interoperable Server Interface to a 
Spatial Database: Implementation Experiences 
with OpenMap^^* 



Charles B. Cranston^, Frantisek Brabec^, Gisli R. Hjaltason^, 
Douglas Nebert^, and Hanan Samet^ 



^ University of Maryland, College Park, MD 20742, USA, 
{zben.brabec ,grh,hjs}@cs .umd.edu 
^ U.S. Geological Survey, Reston, VA 22092, ddnebert@usgs.gov 



Abstract. Many organizations require geographic data originating from 
diverse sources in their day-to-day operations. It is often impractical to 
maintain on-site a complete database, due to issues of ownership, the 
sheer size of the data, or its dynamic nature. OpenMap"'"'^ is a distri- 
buted mapping system that allows displaying together geographic data 
acquired from disparate data sources. In this paper, we report our expe- 
riences with building a “specialist” for OpenMap, allowing the OpenMap 
map browser access to data stored in SAND, a prototype spatial data- 
base system. DUG data from the U.S. Geological Survey were used to 
demonstrate the combined system. Key features of the OpenMap and 
SAND systems are described, as well as how they deal with the DUG 
data. 



1 Introduction 

Geographic data is being digitized at an ever increasing rate. In many cases this 
is driven by the day-to-day needs of the collecting entities: the public utilities 
whose paper maps are becoming frayed and unusable, various ecological and 
scientific entities who use the data in furtherance of their primary missions, or 
even mundane land-ownership tracking by local governments. In other cases new 
technologies, such as space shuttle imaging radar, are making digitized spatial 
information available almost faster than it can be recorded. 

However, the Earth is very complex, far too complex to digitize in its enti- 
rety. Of necessity, it is abstractions of reality that are captured. In this process 
some details are inevitably lost. Moreover, each digitizing interest community 
has its own idea of which information it is important to retain. For example, 
given a river, the ecology community may be interested in the number of small 
frogs per kilometer of bank, a transportation agency may be more interested 
in the positions of current and future bridge crossing sites, while the Defense 
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1434HQ97SA00919, by the National Science Foundation under grant IRI-9712715, 
and the Department of Energy under Gontract DEFG0295ER25237. 
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department may be more interested in knowing at what points the river can be 
forded by an Ml Abrams tank. 

Because of these differing views, data digitized by one organization may not 
be easily shared with other dissimilar organizations, unless some method for 
interoperation can be devised. Yet such sharing is not only desirable but manda- 
tory, as decision makers simply will not pay for essentially duplicative digitization 
efforts. Occasions arise when data simply must be shared. For example: 

An emergency response team requires the synthesis of a map that inclu- 
des geological, soil properties, road network, water lines, demographic 
information, and public service facility locations such as hospitals and 
schools to be plotted for an urban area just impacted by an earthquake. 

No single agency is responsible for this variety of geospatial data yet 
“best-available” information must be assembled and printed for use by 
field personnel in paper and electronic form within 6 hours [9]. 

The OpenGIS (Open Geographical Information Systems) Gonsortium [4] is 
an open, industry-wide consortium of GIS vendors and users who are attemp- 
ting to facilitate interoperability by proposing standards for GIS knowledge in- 
terchange. The goal of the consortium is to enable transparent interworking 
within any one community of interest, and to provide a framework for explicit 
conversion procedures when data is to be shared between differing interest com- 
munities. The consortium was founded relatively recently, so the standardization 
effort is still in its early stages. Nevertheless, a specification of an object model 
for GIS data has been approved by its members. The specification, called Open- 
GIS Simple Features, is in three variants, each tailored to a specific transport 
mechanism: ODBG/SQL92 (Open Database Gonnect /Structure Query Langu- 
age) [12], Microsoft OLE/GOM (Object Linking and Embedding/Gomponent 
Object Model) [14], and GORBA (Gommon Object Request Broker Architec- 
ture) [13]. 

OpenMap"”"^ is a product suite developed by BBN Technologies (now a 
division of GTE Internetworking) in a DARPA-sponsored project to demon- 
strate GORBA-based mapping. Whereas OpenGIS’s Simple Features specifica- 
tion addresses the interface between a GIS database and a GIS application, 
OpenMap"'"'^ specifies an interface between a GIS application and its user in- 
terface (UI). OpenMap includes a user interface client, a client/server interface 
(implemented through GORBA), and a suite of specialists that implement the 
server side of the interface, making a particular kind of data source accessible 
to the user interface client. Thus, it provides a way to integrate geographical 
data from diverse data sources in a single map display. OpenMap has been used 
in technology demonstrations within the OpenGIS community, and its creators 
have initiated a dialogue concerning the need for an open application/UI inter- 
face. 

SAND (Spatial And Nonspatial Data) [1] is a prototype spatial database 
system developed by our group. Its purpose is to be a research vehicle for work in 
spatial indexing, spatial algorithms, interactive spatial query interfaces, etc. The 
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basic notion of SAND is to extend the traditional relational database paradigm 
by allowing row attributes to be spatial objects (e.g., line segments or polygons), 
and by allowing spatial indexes (like quad trees) to be built on such attributes, 
just as traditional indexes (like B-trees) are built on nonspatial attributes. 

As part of a demonstration project involving OpenMap"'"^ we built a SAND 
specialist that makes data stored in SAND accessible to the OpenMap user 
interface client. In the demonstration, we used a SAND database populated 
with DLG (Digital Line Graph) data from USGS (U.S. Geological Survey). This 
paper discusses issues that arose in this work. 

The rest of the paper is organized as follows. Section 2 discusses OpenMap"'"^ 
in more detail. Sections describes the SAND database system. Section 4 presents 
the SAND specialist for OpenMap, while conclusions are drawn in Sect. 5. For 
the interested reader, we have included in an Appendix a discussion of the issues 
arising in converting DLG data to SAND’s native format. 

2 OpenMap^^ and OpenGIS 

A GIS system can be viewed as being divided into three tiers (see Fig. 1), the 
UI (User Interface), Application, and Database tiers. In the Database tier we 
have databases storing actual GIS data, in the Application tier we have appli- 
cations that query the databases and process the result in some manner, and 
in the UI tier we have the graphical user interface where the query result is 
displayed to the user. The OpenGIS Simple Features specification addresses the 
interface between the Application and Database tiers (as well as between ap- 
plications). OpenMap, on the other hand, specifies an interface between the UI 
and Application tiers. This interface is based on GORBA (Gommon Object Re- 
quest Broker Architecture) [11], an industry standard middleware layer based 
on the remote-object-invocation paradigm. By middleware we mean shared soft- 
ware layers that support communication between applications, thereby hopefully 
achieving platform independence. Such a “layering” organizational paradigm has 
been extremely successful in networked computer communications (for example, 
FTP over TGP over IP over Ethernet). The recent adoption of the HOP (In- 
ternet Inter-ORB Protocol) standardizes GORBA interoperation down to the 
TGP/IP protocol layer. Thus any two GORBA applications should be able to 
interwork. 

The central component of OpenMap is the OpenMap Browser, its user in- 
terface client. It includes a map viewing area, navigation controls, and a layers 
palette, in addition to menus and a tool bar. A simplified version of the Open- 
Map Browser was implemented in Java (see Fig. 2), and a variant of it can be 
deployed on any Java enabled Web browser. The layer palette lists map layers 
available to the client. A map layer is a collection of related geographic objects, 
i.e., road network, railroad tracks or country boundaries. The layers come from 
data servers, termed specialists, that communicate with the OpenMap Browser 
using GORBA. The interface specification between specialists and the Open- 
Map Browser allows the Browser to request data objects intersecting a query 
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Fig. 1. A three tier view of a distributed CIS system 



rectangle, where the data objects are graphical objects of various types, inclu- 
ding line segments, circles, rectangles, polylines/polygons, raster images, and 
text. These can be specified either in lat/long coordinates or in screen coordi- 
nates. In addition, the interface provides support for custom palettes that allow 
the user to configure the specialist, and support for gestures, which allow the 
specialist to respond to mouse actions on the displayed graphics. 

Specialists can be implemented either in C-|— I- or Java. Among the com- 
ponents of OpenMap are classes for each language (called CorbaSpecialist and 
Specialist, respective) that encapsulate the common aspects of all specialists [2]. 
A custom data server can be created by extending these classes, adding only 
the specialized routines required to access a particular target database. Details 
of CORBA and session initialization, transfer of query rectangles from client to 
server, and transfer of GIS feature information from server to client are handled 
transparently. 



3 SAND 

SAND [6,7], the spatial database system developed by our research group, is 
divided into two main layers, the SAND kernel, and the SAND interpreter. The 
SAND kernel was built in an object oriented fashion (using C-|— b) and comprises 
a collection of classes (i.e., object types) and class hierarchies that encapsulate 
the various components. Since SAND adopts a data model inspired by the re- 
lational model, its core functionality is defined by the different types of tables 
and attributes it supports. Thus, the table and attribute class hierarchies are 
among the most important. The SAND interpreter provides a low-level procedu- 
ral query interface to the functionality defined by the SAND kernel. Using the 
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query interface provided by the SAND interpreter, we have built a number of 
useful tools. For example, the SAND Browser is an interactive spatial query bro- 
wser, that allows the user to pose queries through graphical input. Also, we have 
built a prototype for a high-level declarative query interface to SAND, modeled 
on SQL. 

3.1 Table Types 

The table abstraction in SAND encapsulates what in conventional databases are 
known as relations and indexes. Tables are handled in much the same way as 
regular disk files, i.e., they have to be opened so that input and output to disk 
storage can take place. All open tables in SAND respond to a common set of 
operations, such as first, next, insert, and delete. 
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SAND currently defines three table types: relations, linear indexes and spatial 
indexes. Each table type supports an additional set of operations, specific to its 
functionality. The function of most of these operations is to alter the order in 
which tuples are retrieved, i.e., the behavior of first and next. 

Relations in SAND are tables which support direct access by tuple identifier 
(tid). Ordinarily, tuples are retrieved in order of increasing tid, but the operation 
goto tid can be used to jump to the tuple associated with the given tid (if it 
exists). 

Linear indexes for non-spatial attributes are implemented using B-trees [5]. 
Tuples in a linear index are always scanned in an order determined by a total 
ordering function. Linear indexes support the find operator, which retrieves from 
the disk storage the tuple that most closely matches a tuple value given as an 
argument. The find operator can also be used to perform range searches. 

Spatial indexes are implemented using PMR-quadtrees [10,16,17]. They sup- 
port a variety of spatial search operators, such as intersect for searching tuples 
that intersect a given feature, or within for retrieving tuples in the proximity 
of a given feature. Spatial indexes also support ranking [8], a special kind of 
search operator whereby tuples are retrieved in order of distance from a given 
feature. 

3.2 Attribute Types 

SAND implements attributes of common non-spatial types (integer and floa- 
ting point numbers, fixed-length and variable-length strings) as well as two- 
dimensional and three-dimensional geometric types (points, line segments, axes- 
aligned rectangles, polygons and regions). All attribute types support a common 
set of operations to convert their values to and from text, to copy values bet- 
ween attributes of compatible types, as well as to compare values for equality. 
Non-spatial attribute types also support the compare operator, which is used 
to establish a total ordering between values of the same type. This is required so 
that non-spatial attributes can be used as keys in linear indexes. Spatial attri- 
bute types support a variety of geometric operations, including intersect which 
tests whether two features intersect, distance which returns the Euclidean di- 
stance between two features (used for the ranking operator), and bbox which 
returns the smallest axis-aligned rectangle that contains a given feature (i.e., 
its minimum bounding rectangle). Some spatial types support additional ope- 
rations. For instance, the region type supports operations like expand, which 
can be used to perform morphological operations such as contraction and ex- 
pansion, and transform, which can be used in the computation of set-theoretic 
operations. 

3.3 The SAND Interpreter 

The SAND kernel provides the basic functionality needed for storing and pro- 
cessing spatial and non-spatial data. In order to access the functionality of this 
kernel in a flexible way, we opted to provide an interface to it by means of an 
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interpreted scripting language, Tcl [15]. Tcl offers the benefits of an interpreted 
language but still allows code written in a high-level compiled language (in our 
case, C-| — h) to be incorporated via a very simple interface mechanism. Another 
advantage offered by Tcl is that it provides a seamless interface with Tk [15], a 
toolkit for developing graphical user interfaces. 

The SAND interpreter provides commands that mirror all kernel operations 
mentioned in the previous section. In some cases, a single command may cause 
more than one kernel operation to be performed. In addition, the interpreter im- 
plements data definition facilities. The processing of spatial queries is supported 
by interpreter commands associated with spatial attributes, spatial indexes and 
bounding structures. 

4 SAND Specialist for OpenMap^'^ 

In this section we describe a specialist for OpenMap that provides access to 
geographic data stored in SAND relations. We implemented this specialist in 
Java, and thus it is based on the Specialist class provided by BBN. Figure 3 shows 
the software components of an OpenMap session, where the structure of the 
SAND specialist is detailed. The user interface client uses CORBA middleware to 
communicate with various specialists, each of which provides access to a specific 
type of data source. The SAND specialist code communicates with the UI client 
with methods inherited from the Specialist class, and in turn invokes the SAND 
interpreter to perform the actual data access. The SAND specialist responds 
to requests for objects in a particular map layer intersecting a query rectangle. 
(In this case, each map layer corresponds to a SAND relation.) In addition, the 
SAND specialist directs the UI client to display a custom palette for each layer, 
where the color of the data objects in the layer can be set. 




Fig. 3. Structure of the SAND specialist for OpenMap"""^ 

The data made available by the SAND specialist in our demonstration was 
obtained from USGS and is in the form of points, polylines and polygonal areas 
(see Sect. 5 for details of how this data was imported into SAND). Since the 
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goal of the demonstration was to show how multiple maps from diverse sources 
could be overlayed on top of each other, it was undesirable to display filled 
polygons as they obliterate any map features in lower layers. Thus, we chose to 
focus on polylines and polygon boundaries, stored as line segments in the SAND 
relations representing each map layer. (An alternative approach would have been 
to represent polygonal areas with the SAND polygon attribute type, and convert 
from polygons to polylines or line segments in the server at run time.) A spatial 
index was built on the line segment attribute in the map layer relations in order 
to allow efficient spatial lookup. 



4.1 Implementation of the SAND Specialist 

The SAND specialist communicates with the SAND interpreter by passing it Tcl 
scripts that implement specific database queries using the low-level SAND query 
interface, and receiving back textual output through an I/O pipe. GIS features 
satisfying the query are translated into OpenMap Specialist objects, which are 
then passed on to inherited Specialist methods for transport to the client display. 
The Tcl script that the SAND specialist passes to the SAND interpreter selects 
a database, opens a spatial index, and then executes a query passing a rectangle 
as an argument (the only query currently specified in the OpenMap specialist 
interface is such a rectangle intersection query): 

sand cd <directory> ; 

set index [sand open <indexnEmie>] ; 

$index first -intersect \ 

{rectangle <left> <bottom> <width> <height>}; 

while {[$index status]} { 
puts [$index get] ; 

$ index next; 

} 

$index close 

The <directory> argument to the sand cd command specifies the file system 
directory that holds the database. The sand open command returns a handle 
to an open table (see Sect. 3.1), in this case an index on a spatial attribute. 
The handle, which corresponds to an underlying C-| — h spatial index object, can 
subsequently be used to perform actions on the index. The script initiates a 
spatial window query by invoking the table command first with a query rec- 
tangle, which loads the first tuple satisfying the query (if any exists), i.e., a tuple 
corresponding to a line segment that is intersected by the given rectangle. The 
table command get returns the contents of the current index tuple, in this case 
storing a line segment. The table command next loads the next tuple satisfying 
the query, or sets the table status to false if none exists. In fact, not only does 
the SAND kernel return the tuples satisfying the query one-by-one (through the 
SAND interpreter), but actually executes the query incrementally rather than 
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batch style. The while loop outputs the line segment (which is received by the 
SAND specialist), and fetches another one until all line segments intersecting 
the query rectangle have been exhausted. At that point, the index file is closed. 

This query plan, which makes use of a spatial index, is more efficient than the 
straightforward one of initiating a sequential scan of a relation, then testing each 
line segment for intersection with the query rectangle and only outputting those 
segments that actually intersect it. However, as it turned out in our experiment, 
the time saved was actually drowned out by communication costs. (We will 
discuss this at greater length in Sect. 5.) 

The line segments as returned over the I/O pipe are of the form: 

{line <xl> <yl> <x2> <y2>} 

The SAND specialist parses this format, creates two OpenMap Specialist points, 
creates a Specialist line between the two points (using either black or a color 
determined by a control subpanel), and adds the line to the display list for return 
to the client. 

Each coordinate value undergoes four data conversions. The data in the in- 
dex file is in binary floating format. The SAND interpreter converts this to an 
ASCII string representation (conversion I) to return it over the I/O pipe to the 
SAND specialist. The SAND specialist reads the ASCII string representation 
from the pipe and converts it back to binary floating (conversion 2). It must do 
this because Specialist’s Point object creator takes binary floating arguments. 
Presumably Specialist must convert this machine-specific binary floating value 
into some machine independent wire format (conversion 3). Finally, the display 
client must convert from the wire format to some display-device specific format 
(conversion 4) for eventual display. 

The Java source file for SAND specialist contains 275 lines of code. Of this, 
roughly 200 implement the basic server and another 50 implement the palette 
for control the color of the line segments in a SAND specialist map layer. The 
rest are comments and housekeeping “import” statements. 

4.2 Sample OpenMap^''^ Session with the SAND Specialist 

Figure 2 shows the OpenMap UI client displaying three map layers obtained from 
the SAND specialist. Hydrography (i.e., rivers, lakes, etc.), Boston Roads, and 
Railroads. The window on the lower right is a control panel, where the user can 
select the layers to display as well as pan and zoom on the map display. The layer 
list on the left side of the control panel has two check boxes for each available 
layer. The right-hand one selects the layer for display, whereas the left-hand one 
causes the creation of a custom palette speciflc to the layer. The Hydrography 
and Railroads layers each have a palette that allows setting the color of their line 
segments (the two windows on the lower left). The color for the Hydrography 
layer is set to blue, and the color for the Railroads layer is set to red. The color of 
the Roads layer is set to the default color, black, since it does not have a palette 
associated with it. When the user clicks on the “Redraw” button, a query is sent 
to each selected server for any geographic objects visible in the display area. 
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5 Concluding Remarks 

Our research agenda includes developing efficient ways to access spatial data 
using advanced indexing structures and thus one of our goals for the demon- 
stration software was to showcase the speed advantage these make possible com- 
pared to the simple sequential scan paradigm commonly used in industry. This 
desire caused us to prematurely optimize our software, which became a pro- 
blem during development. For example, we spent too much time worrying about 
cross-language linkage delay, the delay caused by using an I/O pipe to commu- 
nicate with the SAND interpreter (as opposed to assembling all the code into 
one binary) , the delay caused by our need to translate each data point between 
binary floating point and string representation four times, etc. 

When the demonstration was finally assembled, we found that the speed 
advantage gained from our indexed access was completely swamped by the delays 
in other software components and by the communication cost. In addition, there 
is an inherent conflict between the incremental nature of our software and the 
batch nature of other software components in the demo. SAND was designed to 
be incremental; data are typically released from the query as soon as they become 
available. On the other hand, the Specialist class (which implements the CORBA 
communications) seemed to gather the complete data set before transmitting any 
data to the client, and in fact seemed to do significant processing before doing 
so. As an indication of this difference, for a typical query result size of a few 
thousand line segments, SAND would start returning data within five seconds 
and complete the query within 30 seconds. However, the Specialist class might 
then spend over two minutes marshalling the results before the client could show 
the user any feedback to the user’s query. 

The design decisions we faced in implementing the SAND Specialist were 
largely those common to most software development projects. We had to decide 
on an implementation language (Java or C-| — h; we chose Java), and how our 
code would couple to existing systems (i.e., loosely or tightly coupled; we chose 
a loosely coupled approach) . We needed to develop a data philosophy that would 
support the desired results, and to make that philosophy work with the data we 
could easily obtain. In retrospect, many of the problems we faced were due to 
premature optimization and a failure to test error paths early enough (i.e., errors 
occuring in various parts of the system weren’t apparent, the only visible result 
being that no data was displayed). 

The work described here can be extended in a number of ways. Some direc- 
tions for future work include the following: 

— Currently the SAND specialist invokes the SAND interpreter separately for 
each query, thereby incurring the overhead of opening and closing an I/O 
pipe. In order to allow the I/O pipe between the SAND specialist and the 
SAND interpreter to remain open between queries, we need to invent some 
explicit synchronization scheme. 

— At this time, the non-spatial attributes of the SAND relations are not used 
by the SAND specialist. We plan to extend the SAND specialist to make 
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use of these attributes by modifying the display of the spatial features. This 
could be as simple as feature coloring, or as complicated as automatic legend 
placement. 

— The OpenMap specialist interface supports gesturing thereby allowing the 
specialist to act on user interface actions. A simple example might be to 
display (in textual form) the non-spatial attributes of the database tuple 
closest to a mouse click. 

— Extensions to deal with areal (i.e., polygon) data such as choropleths. 

— Multi-resolution display: a road drawn as a line segment on a low-resolution 
display might be better represented at higher resolutions as a long thin 
rectangle or polygon. An intelligent specialist could make this translation. 

— The addition of other interoperable interfaces to SAND, e.g., the OpenGIS 
SimpleFeatures interface. We expect that most of the lessons learned in im- 
plementing the OpenMap"'"'^ server will be directly transferrable to work on 
implementing other interfaces. One such lesson is that one should take a ca- 
reful look at network communication costs before expending too much effort 
in optimizing the local query execution of the database server. 



References 

1. W. G. Aref and H. Samet. Extending a DBMS with spatial operations. In 
O. Gunther and H. J. Schek, editors. Advances in Spatial Databases — Second 
Symposium, SSD’91, pages 299-318, Zurich, Switzerland, August 1991. (Also 
Springer- Verlag Lecture Notes in Computer Science 525). 

2. BBN Corporation. Designing CORBA(Orbix/VisiBroker) Specialists for 
BBN’s OpenMap, 1997. Available as http: //javamap.bbn. com/projects/matt/ 
development/specialist .html on the web. 

3. K. J. Boyko, M. A. Domaratz, R. G. Fegeas, H. J. Rossmeissl, and E. L. Usery. An 
enhanced digital line graph design. U. S. Geological Survey Circular 1048, 1990. 
(Also see http://edcwww.cr.usgs.gov/glis/hyper/guide/usgs_dlg). 

4. K. Buehler and L McKee, editors. The OpenGIS Guide — Introduction to In- 
teroperable Geo-Processing, Wayland, MA, 1996. OpenGIS Consortium. OGIS TC 
Document 96-001, available as http://www.opengis.org/guide on the web. 

5. D. Comer. The ubiquitous B-tree. ACM Computing Surveys, 11(2):121-137, June 
1979. 

6. C. Esperanga and H. Samet. Spatial database programming using SAND. In M. J. 
Kraak and M. Molenaar, editors, Proceedings of the Seventh International Sympo- 
sium on Spatial Data Handling, volume 2, pages A29-A42, Delft, The Netherlands, 
August 1996. International Geographical Union Comission on Geographic Infor- 
mation Systems, Association for Geographical Information. 

7. C. Esperanga and H. Samet. An overview of the SAND spatial database system. 
Communications of the ACM, to appear. 

8. G. R. Hjaltason and H. Samet. Ranking in spatial databases. In M. J. Egenhofer 
and J. R. Herring, editors, Advances in Spatial Databases — Fourth International 
Symposium, SSD’95, pages 83-95, Portland, ME, August 1995. (Also Springer- 
Verlag Lecture Notes in Computer Science 951). 

9. Doug Nebert. WWW mapping in a distributed environment: Scenario of visuali- 
zing mixed remote data, 1997. Available as http://www.fgdc.gov/publications/ 
documents/clearinghouse/wwwmap_scenario.html on the web. 




126 



C.B. Cranston et al. 



10. R. C. Nelson and H. Samet. A consistent hierarchical representation for vector 
data. Computer Graphics, 20(4): 197-206, August 1986. (Also Proceedings of the 
SIGGRAPH’86 Conference, Dallas, August 1986). 

11. Object Management Group. CORBA 2.0/IIOP Specification, 1997. OMG formal 
document 97-09-01, available as http://www.omg.org/corba/c2indx.htm on the 
web. 

12. Open GIS Gonsortium, Inc. Open GIS Simple Features Spe- 
cification for SQL Revision 1.0, March 1998. Available as 

http://www.opengis.org/public/sfrl/sfsql_rev_l_0.pdf on the web. 

13. Open GIS Gonsortium, Inc. OpenGIS Simple Features Speci- 
fication for CORBA Revision 1.0, March 1998. Available as 

http://www.opengis.org/public/sfrl/sfcorba_rev_l_0.pdf on the web. 

14. Open GIS Gonsortium, Inc. OpenGIS Simple Features Specifi- 
cation for OLE/COM Revision 1.0, March 1998. Available as 

http://www.opengis.org/public/sfrl/sfcom_rev_l_0.pdf on the web. 

15. J. K. Ousterhout. Tcl and the Tk Toolkit. Addison- Wesley, 1994. 

16. H. Samet. Applications of Spatial Data Structures: Computer Graphics, Image 
Processing, and GIS. Addison- Wesley, Reading, MA, 1990. 

17. H. Samet. The Design and Analysis of Spatial Data Structures. Addison- Wesley, 
Reading, MA, 1990. 



Appendix: Data Conversion ~ DLG to SAND 

As the source for data in our demonstration we used data sets from USGS (U.S. 
Geological Survey), encoded in the DLG (Digital Line Graph) format [3]. In this 
section, we briefly describe the DLG format, and present issues that arose in the 
conversion of DLG data to a format readable by SAND. 

Prior to describing the DLG format we point out that a DLG map of the 
same geographic area is divided into several layers. Each layer is represented in 
a separate DLG file: 

— Hydrography: flowing water, standing water, and wetlands. 

— Roads and trails. 

— Railroads. 

— Pipelines, transmission lines, and miscellaneous transportation. 

— Hypsography: contours and supplementary spot elevations. 

— Boundaries: state, county, city, and other national and state lands such as 
forests and parks. 

— Public Land Survey System including town-ship, range, and section infor- 
mation. 

Typically each of these layers is displayed in a different color in order for it to be 
easier to tell them apart. Since the DLG flies are distributed in datasets covering 
a rather small region (or .5° square for 1:100, 000-scale DLG maps), we merged 
several datasets together to represent a larger area. 

The DLG format encodes information about geographic features, in the form 
of points, polylines and polygonal areas, together with associated non-spatial 
information. It has primarily been used to encode printed cartographic maps into 
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digital form. As the name suggests, it is assumed that the map being represented 
forms a planar graph. Thus, DLG files are composed of node, line and area 
identifier elements. A single node is defined by its coordinates and may mark 
the start or the end of one or more lines or a point feature. Therefore nodes occur 
where lines intersect and at places on linear features where they are subdivided 
into separate line segments. A line (corresponding to an edge in the graph) is 
defined as sequences of line segments, with a node anchored at each end. Lines do 
not cross over themselves or any other lines in the map. An area is a contiguous 
region of a map bounded by lines. It is defined by a sequence of line references. 

Along with their spatial information, nodes, lines and areas can carry feature 
codes^. These are numerical codes, used to describe the physical and cultural 
characteristics of the corresponding geographic features. For example, a feature 
code for an area might identify it to be a lake or swamp, and a feature code for 
a line might identify a road, railroad, stream, or shoreline. Features in DLG can 
have any number of feature codes. 

In order to make the data stored in DLG files usable by our SAND system, 
we had to convert the DLG data to SAND’s native format, which consists of a 
set of relations. Each relation in a SAND database contains an arbitrary, but 
fixed, number of attributes. Usually, only one of these attributes is of a spatial 
attribute type, but this is not a requirement. For our purposes we decided to 
focus on the line data provided by the DLG files. Nodes are mostly used to 
define line segments, and the spatial objects represented by a node type (e.g., 
wells, tunnel portals) were found to be of limited interest. The areas stored in 
DLG files are defined as a sequence of lines so each area was exported to the 
SAND database as such, but provisions were made to indicate that a set of lines 
originally defined a single area. 

Although not currently used in the SAND specialist, we represented the fea- 
ture codes of each feature in non-spatial attributes in the corresponding relation. 
We faced the dilemma that in DLG, a feature can have any number of feature 
codes, whereas the number of attributes in tuples of a given relation is fixed. To 
solve this, we divided the set of feature codes into three classes, primary, secon- 
dary, and the rest. Primary and secondary feature codes are meant to represent 
the most important characteristics of a feature. They are chosen in such a way as 
to make it very likely that there will be at most one primary and one secondary 
feature code for each feature. For example, if the feature is a road, the primary 
feature code would specify the type of the road (primary route, trail, footbridge, 
etc.), the secondary feature code would provide additional information (number 
of lanes, interstate route number, county route etc.) and all other feature codes 
would be stored together in a third attribute of the relation tuple storing the fea- 
ture (in tunnel, on bridge, private etc.). Unfortunately, this strategy sometimes 
fails, i.e., a feature can have more than one primary or secondary feature code; 
for instance, a certain road segment could be part of several different interstate 

^ The usual term for the codes is “attributes” . However, siuce they are a very different 
concept from “attributes” as used in SAND (to mean fields in tables), we use the 
alternative term “feature codes” to avoid confusion. 
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routes. In this case, only one of the feature codes is stored as the primary or 
secondary one. 

Another issue that had to be resolved was the conversion of coordinates 
used to define the locations of spatial features. In our demonstration we used 
DLG files digitized from maps of scale 1:100,000; such DLG files define the 
location of objects with respect to the Universal Transverse Mercator (UTM) 
Projection, which is a map projection that preserves angular relationships and 
scale. However, OpenMap (and some of the spatial functions in SAND) assumes 
that spatial objects are specified using latitude and longitude coordinates. Thus 
we had to convert each DLG coordinate pair into latitude/longitude coordinates. 
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Abstract. This work investigates the practical issue of mapping existing 
CIS to the OpenGIS standards We describe the data models used in 
three systems (MCE, ARC/INFO and SPRING) and analyse the pro- 
blems involved when mapping them to OpenGIS. Our conclusion is that 
the OpenGIS standard has not been defined in a formal and unequivocal 
way, and therefore, there are indefinitions and competing alternatives for 
mapping existing GIS systems into the proposed standard. 



1 Introduction 

The issue of interoperability is currently subject to substantial efforts, both from 
an academical and an industrial perspective. In this issue, academia and industry 
have taken different, if complementary perspectives: whereas there is a major 
effort in the industry towards a consensus-based solution [1], researchers have 
concentrated efforts in theoretical issues, such as abstract models for semantic 
interoperability [2] [3]. 

This work tries to bridge the gap between the two approaches, aiming to 
describe and analyse the practical barriers to interoperability. We start from 
the pragmatic consideration that most end-users, which will eventually build 
interoperability frameworks, already have a large geographical database, organi- 
sed around a commercial system and based on a proprietary data model. These 
users will probably soon face a decision as regards the introduction of techno- 
logy which will support the OpenGIS standards, and will probably have a choice 

^ This work has been supported by FAPESP-Fundagao de Amparo a Pesquisa no 
Estado de Sao Paulo 
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between different commercial implementations and migration paths from their 
existing environment. Therefore, in our assessment of interoperabilty in practice, 
we aim to understand issues such as: 

— Is the OpenGIS proposal a truly generic model, which is able to provide 
semantic equivalents to concepts on existing proprietary data models ? 

— What do real-world systems teach us about the problems of semantic intero- 
perability and possible limitations of the OpenGIS approach? 

— How effective and easy will be the migration from proprietary frameworks 
to environments such as OpenGIS? 

— What sort of tools would simplify the migration from existing GIS to the 
OpenGIS framework? 

— What lessons can be learned, from the academic perspective to interoperabi- 
lity, from considering the interoperability challenges to today’s technology? 

In order to address these questions, we have examined three existing GIS 
solutions: MGE [4], ARG/INFO [5] and SPRING [6]. We have chosen these 
systems because the first two are representative of existing technology and claim 
a significant proportion of GIS market share. The choice of SPRING is based 
on two reasons: this system has been developed by INPE, being therefore well 
known to the authors, and its data model explicitly supports the abstractions of 
fields and objects. 

This work is divided in three parts. In Section 2, we briefly examine the 
semantic models used by MGE, ARG/INFO and SPRING. In Section 3, we 
describe a possible mapping between these systems and OpenGIS, which could 
be used in real-world migration to OpenGIS. We conclude with Section 4, where 
we consider the theoretical and practical consequences of our findings. 

2 Semantic Models of Existing Systems 

The semantic models of existing systems are a clear demonstration of the barriers 
faced by the interoperability issue in GIS. In the vast majority of cases, these 
semantic models have been derived based on practical considerations, mostly 
related to the data structures used for representing geographical data on a com- 
puter. 

In order to present the data model of existing systems and of the OpenGIS 
model, we have used Rumbaugh’s OMT diagrams [7], which capture the notions 
of specialisation (“is-a”) and aggregation{^‘ha,s-a”). In what follows, semantic 
constructs of the different data models are marked in SMALLCAPS. 



2.1 A Generic Reference Model for Geographical Data 

Our working hypothesis for comparing the semantic models of different systems 
is that, given the great differences between them, a generic reference model is 
necessary to establish a common base into which each system’s abstraction will 
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be referred to. The conversion to OpenGIS, therefore, requires two steps: (a) 
mapping from the system’s abstractions to the reference model; (b) conversion 
from the reference model to OpenGIS. 

We will use an abstract formulation as a reference for comparing the semantic 
model of different systems: the concepts of fields and objects [8]. The field model 
views the geographical reality as a set of spatial distributions over the geographi- 
cal space. Features such as topography, vegetation maps and LANDSAT images 
are modelled as fields. The object model represents the world as a surface occu- 
pied by discrete, identifiable entities, with a geographical location (with possible 
multiple geometric representation) and descriptive attributes. Human-built fea- 
tures, such as roads and buildings, are typically modelled as objects. For a more 
detailed discussion on these issues, the reader should refer to [9] [10] and [11]. 

In what follows, we will consider the following definitions: 

— A geographical field (or geo-field) is defined by a relation / = [R, V, A], where 
i? is a geographical region, V a set of attributes and A : i? — >■ H is a mapping 
between points in R and values in V. Examples of geo-fields include: thematic 
geo- fields, (when E is a finite denumerable set), and numerical geo- fields, 
(when V is the set of real values), corresponding - respectively - to the 
intuitive notions of thematic maps and digital terrain models. 

— Given a set of geographical regions Ri, ... ,Rn and a set of attributes Ai , 
. . . , An with domains R(Ai, . . . , D(An), a geographical object (or geo-object) 
is defined by a relation [ai, . . . , a„, S'!, . . . , S'™], where Oi are its descriptive 
attributes (a^ G D{Af)) and Si its geographical locations {Si C Ri). 



2.2 The MGE Data Model 

The MGE (’’Modular GIS Environment”) data model uses three main concepts: 
CATEGORIES, FEATURE TYPES and FEATURES [4]. A geographic element is re- 
presented as a FEATURE. Features are instances of feature types, which may, 
in turn, be further grouped into categories, as shown in Fig. 1. Each feature 
TYPE is associated to an attribute table. 

Objects in MGE are modelled by a two-level hierarchy. An object (feature) 
is an instance of a feature type, and feature types can be aggregated into 
CATEGORIES. 

MGE does not include an explicit notion of fields. Vector representations 
of thematic maps use the concepts of categories and feature types, with 
two alternatives: (a) the thematic map may be considered as part of a single 
FEATURE TYPE (e.g. “Vegetation”), whose values stored as attributes in the at- 
tribute TABLE; (b) the thematic map may be considered as a higher-level entity 
(a CATEGORY such as “Land Gover”) and each of its values (thematic classes) 
is modelled as a different feature TYPE(e.g. “Urban Area”, “Forest”, “Agri- 
culture”). Raster representations of thematic maps are modelled as regular 
GRIDS and constitute a separate entity from their vector representation. Digital 
terrain models are stored separately as tin or regular grid, depending on the 
chosen representation. 
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In resume, the MGE data model can be considered to be object-based, with a 
two-level hierarchy. Geo-fields (such as thematic maps or digital terrain models) 
are modelled directly through one of their representations. 




Fig. 1. MGE’s Data Model 



2.3 The ARC/INFO Data Model 

The ARG/INFO data model [5] has four basic abstractions: coverage, grid, 
TIN and ATTRIBUTE TABLE. A COVERAGE is a vector representation of geographi- 
cal data, associated to an attribute table, which describes the map elements 
(points, arcs or polygons). In this model, the concept of object (or feature) does 
not exist explicitly; objects are implemented as rows of the attribute table, 
which is required to maintain a unique index. 

Thematic maps have two possible representations: their vector representa- 
tion is mapped to a coverage, where one or more fields in the attribute 
TABLE indicate the attributes associated to each geographical location. The ra- 
ster representation of a thematic map uses an integer grid, associated to an 
ATTRIBUTE TABLE, which indicates, for each value in the grid, the corresponding 
attributes. Digital terrain models can be mapped either as an “floating-point 
grid” or as a triangular mesh (tin). 

In resume, the ARG/INFO data model is representation-oriented', instead of 
describing the world in terms of objects and fields, it allows the user to define 
and manipulate geometrical representations. He will therefore be responsible for 
externally defining the abstract entities and for mapping those entities to the 
most appropriate representation. The ARG/INFO data model is shown in Fig. 2. 

2.4 The SPRING Data Model 

SPRING is a public-domain GIS developed by INPE [6], available on the Inter- 
net (http://www.dpi.inpe.br/spring), whose data model (shown in Fig. 3) 
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Fig. 2. ARC/INFO’s Data Model 



is based on the abstractions of fields and objects. SPRING uses the definiti- 
ons of GEO-FIELD and GEO-OBJECT presented in Sect. 2.1, and allows for two 
particular types of GEO-fields: 

— THEMATIC GEO-FIELDS are fields whose mapping function associates geogra- 
phical locations to a finite denumerable set, and corresponds to the intuitive 
notion of a thematic map. Examples are soil, vegetation and land use maps. 

— NUMERICAL GEO-FIELDS are fields whose mapping function associates geo- 
graphical locations to the set of real values, corresponding to the intuitive 
notion of a digital terrain model. 

Moreover, the model distinguishes between these abstract definitions and 
their geometrical representations, since: 

— GEO-FIELDS Can be associated simultaneously to vector and raster repre- 
sentations. Thematic geo-fields can be represented as a vector (polygon 
map) or as raster (integer grids). Numerical GEO-fields can be represen- 
ted as vectors (contour maps, samples or TINs) or in raster format (fioating- 
point grids). In other words, the relation between a GEO-field and its 
representation is one of aggregation (”has-a”) and not a specialisation (”is- 

a”). 

— GEO-OBJECTS Can be mapped into different geometrical vector representa- 
tions, with different topologies (polygon maps, networks and point maps). 
For this purpose, SPRING uses the auxiliary concept of GEO-OBJECT map, 
as explained below. 

Since most applications in GIS do not deal with isolated elements in space, it 
is convenient to store the graphical representation of geo-objects together with 
its neighbours. For example, the parcels of the same city borough are stored and 
analysed together. These requirements lead SPRING to introduce the concept of 
a GEO-OBJECT MAP, which groups together geo-objects for a given cartographic 
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projection and geographical region. Therefore, the geometric representations for 
geo-objects are maintained in instances of the class GEO-OBJECT map. The rela- 
tion between GEO-OBJECTS and GEO-OBJEGT map is one of “is-represented-by” . 
Use of the model concepts has enabled the design of an user interface and a 
query and manipulation language for SPRING which allows manipulation of 
geographical data at an abstract level [12]. 




Fig. 3. SPRING’S Data Model 



3 Mapping into the OpenGIS Semantic Model 

3.1 The OpenGIS Semantic Model 

The OpenGIS model [1] is based on an abstract class (feature ) which has 
two specialisations: feature with geometry and coverage. The definition 
of FEATURE WITH GEOMETRY is associated with the idea of geo-object (as given 
in Sect. 2.1) and allows for complex geometrical representations to be associated 
to the same feature and for different features to share the same geometrical re- 
presentation. The locational support for each feature is modelled by the concept 
of SPATIAL REFERENCE, whereby geometry structures (such as lines, points and 
polygons) which describe the geographical locations of the feature are related to 
a reference extent in a given projection. 

In OpenGIS, coverages are metaphors of phenomena over the Earth’s sur- 
face, whose spatial domain is a c-function, which associates each location a set 
of attributes. A coverage may be specialised into one of several geometrical 
representations, including: image, grid, lines and tin. Fig. 4 illustrates the 
OpenGIS semantic model. 
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The OpenGIS specification does not provide a formal definition of feature 
WITH GEOMETRY and COVERAGE, stating that these concepts represent ’’alter- 
native ways of representing spatial information” [1] . Therefore, there is no exact 
equivalence in OpenGIS to the abstractions of geo-objects and geo-fields, presen- 
ted in Sect. 2.1. This choice of industry-established terms, instead of simplifying 
the migration of existing systems, is likely to cause significant problems in the 
adoption of OpenGIS in real-life situations, as discussed in later sections. 

OpenGIS is an evolving standard and, as of September 1998, the consortium 
had not published a conclusive definition for the idea of feature collections, 
that would allow the expression of complex features, such as feature hierarchies. 

It is important to note that there is a semantic mismatch between the con- 
cept of FEATURE WITH GEOMETRY and COVERAGE in OpenGIS. The definition 
of FEATURE WITH GEOMETRY is abstract and generic, whose relation to its geo- 
metry is one of aggregation (“a feature has many geometric representations”). 
Goverages are directly related to their geometric representations, by specialisa- 
tion (“a grid coverage is a coverage”). 




Fig. 4. OpenGIS Data Model 



3.2 Mapping Existing Databases into OpenGIS 

The scenario envisaged in this paper is a situation where an institution, which 
has an established geographical data base in a proprietary format, would like to 
migrate to the OpenGIS model. In this process, they will need to find approxi- 
mate equivalents in the OpenGIS concept to their semantic models. 
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One important consideration here is that the mapping should be able to be- 
nefit, as much as possible, from the features and tools provided by OpenGIS. 
This will require an extra abstraction level which is not present in most seman- 
tic models of existing systems: that of establishing whether the data represent 
objects or fields. 



3.3 Mapping MGE into OpenGIS 

The mapping of MGE concepts into OpenGIS definitions faces a meaningful 
issue: the two-level feature hierarchy of MGE semantic model (categories and 
FEATURE types) requires the concept of feature collections in OpenGIS 
to be fully defined; otherwise, a significant part of MGE’s semantic richness will 
be lost in the translation. 

In the case of objects, they are defined in MGE using the category-feature 
TYPE hierarchy, which would require an equivalent in OpenGIS, namely, the 
FEATURE COLLECTIONS-FEATURE WITH GEOMETRY hierarchy. 

The issue is further complicated in the case of thematic maps, we have an 
ambigous situation. As discussed in Sect. 2.2, there are two possible ways of map- 
ping vector representations of thematic maps in MGE: (a) using the category- 
feature TYPE hierarchy; (b) collapsing the category-feature type notions 
into a single concept and using the attribute table to store the values of the 
theme associated to each geographical area. When mapping to OpenGIS, si- 
tuation (a) requires the abstraction of feature collection to be defined in 
OpenGIS for a direct mapping to take place, and situation (b) is best handled 
by using the geometry coverage definition in OpenGIS. 

Digital terrain models in MGE (represented as tin or grid) are mapped in 
a straightforward fashion to OpenGIS grid coverage and tin coverage. 



3.4 Mapping ARG/INFO to OpenGIS 

When mapping ARG/INFO to OpenGIS, the user will first have to define whether 
the data being mapped refer to fields or to objects. Since ARG/INFO does not 
provide an explicit way of representing objects (they are implemented as unique 
indexes of the attribute table), such a translation cannot be automatic but 
will require user intervention. 

The issue is further complicated by OpenGIS’s choice of the terminology 
COVERAGE. Some users will be tempted to automatically map an ARG/INFO 
COVERAGE into an OpenGIS coverage, when in fact these are not exact equi- 
valents. In the case of objects, ARG/INFO ’s coverages are best mapped into 
OpenGIS using the feature with geometry concept, in order to benefit from 
the query functions defined by OpenGIS (which include topological operators). 

Thematic maps in ARG/INFO are mapped directly to OpenGIS. Vector re- 
presentations of such maps (which are ARG/INFO coverages) can be transla- 
ted to OpenGIS line coverages, and raster representations (which are 
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ARC/INFO grids) are mapped into OpenGIS grid coverages. Digital ter- 
rain models are mapped directly to their OpenGIS equivalents (ARG/INFO’s 
GRID and TIN to OpenGIS’s tin coverage and grid coverage). 

3.5 Mapping SPRING to OpenGIS 

Since the SPRING data model is based on the abstractions of fields and ob- 
jects, its mapping into OpenGIS is somewhat simplified. Objects in SPRING 
correspond directly to features with geometry in OpenGIS. However, the 
support for multiple representations in SPRING (with the idea of a GEO-OBJECT 
map) is not fully available in the current OpenGIS specification. In the special 
case when there is one geo-object map asssociated to a set of objects, it can 
be mapped to the concepts of OpenGIS geometry and spatial reference 

SYSTEM . 

In the case of fields, the situation is more complicated. As discussed earlier, 
the relation of an OpenGIS coverage to its subtypes is one of specialisation. 
In SPRING, THEMATIC and numeric fields can have multiple representations 
(raster and vector), a notion which is consistent with the abstract definition of a 
field. Therefore, one SPRING thematic field will be mapped to many OpenGIS 
COVERAGES, depending on how many representations are associated with it. As 
a consequence, an important part of SPRING’S semantic expressiveness will be 
lost in the process of translation to OpenGIS. 

4 Conclusion: Practical Challenges to Interoperability 

The main conclusions of the above discussion on the issues of mapping existing 
systems to OpenGIS are: 

— Some systems have semantic models which are richer in content than the 
OpenGIS one (e.g. category-feature class definition in MGE and GEO- 
FIELD definition in SPRING). 

— The use of industry-established terminology in OpenGIS is a mixed blessing. 
Instead of simplifying the migration process, it may rather be a source of mi- 
sunderstanding (e.g., the mapping of an ARG/INFO coverage to OpenGIS 
FEATURES WITH GEOMETRY). 

In each case examined, there were different mapping alternatives from the 
system to OpenGIS, which indicate that automatic migration to OpenGIS is 
not a recommended option and that a higher level of semantic modelling is 
needed before the actual mapping to OpenGIS takes place. This higher level of 
modelling would be an abstract description in terms of fields and objects (or a 
more sophisticated approach), along the lines of [2]. 

In our opinion, one of the main sources of the problems we have described 
is the choice of the OpenGIS consortium to use industry terminology, which 
is already content-rich and are associated by the users with existing semantic 
concepts (feature from MGE and coverage from ARG/INFO). Had OpenGIS 
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chosen to describe its concepts in more abstract and formal terminology, some 
of these problems might have been avoided. 

Another problem is the semantic mismatch between the notions of feature 
WITH GEOMETRY and COVERAGE in OpenGIS, the first being an abstract 
definition and the second, directly linked to its geometric representations. This 
could be improved if the OpenGIS definition of coverage be changed to an 
abstract one, where its relation to the representations is one of aggregation. 

In conclusion, the analysis we have conducted has lead us to believe that 
there will be major challenges in practice to achieve interoperability, even with 
the adoption of the extensive work which is being pursued by the OpenGIS 
consortium. It also calls for theoretical work to be carried out regarding the 
issue of deriving rich semantic models, which could provide a general framework 
into which existing semantic models could be mapped, without loss of content 
and that would allow the later conversion to other models. 
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Abstract. A software testbed was constructed by reusing components 
developed for a previous demonstration and by adding new components. 
These components were constructed usiM prototypes of software imple- 
menting early versions of the OpenGIS® Simple Features Specificati- 
ons^. The demonstration was presented at the Open GIS Consortium 
booth at GEObit in Leipzig, Germany in May 1998. The original de- 
monstration from October, 1997 and the demonstration of May 1998 
are described. Several issues regarding interoperability, data modelling, 
data scale, and usability are discussed. Data used in the demonstration 
include 1:1,000,000 scale Digital Chart of the World and 1:1,000 scale 
cadastral data for the city of Berlin. Technical issues are discussed and 
user feedback is presented. 



1 Introduction 

1.1 Background and Intention of the OGC/FGDC Demo 

In August 1997, under the sponsorship of the Open GIS Consortium (OGC), a 
group of OGC members embarked on an ambitious demonstration [I] [3] (which 
has subsequently become known as the FGDC demonstration, being named after 
its primary sponsor, the US Federal Geographic Data Committee) of how Open 
GIS would benefit the construction of distributed spatial applications. The de- 
cision to attempt the demonstration was made following the acceptance by the 
OGC Technical Committee of the OpenGIS® Simple Features Specifications for 
SQL, CORBA, and OLE/COM [5]. The specifications were accompanied by pro- 
totype implementations of the interfaces contained in the specifications. These 

^ OpenGIS is a trademark or service mark, or registered trademark or service mark, 
of the Open GIS Gonsortium, Inc. in the US and other countries. 

^ The software used in the demonstrations has not been tested for conformance to 
the Open GIS Gonsortium specifications, particularly given the fact that no such 
tests have been applied to any software at the date of the demonstrations. The 
specifications as they were available at the time were used in the construction of some 
of the software in order to provide a practical test of the specifications. Therefore 
the authors make no claims about actual conformance to current versions of the 
specifications. 



A. Vckovski, K.E. Brassel, and H.-J. Schek (Eds.): INTEROP’99, LNCS 1580, pp. 139—149, 1999. 
© Springer- Verlag Berlin Heidelberg 1999 
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implementations were to be used in the subsequent October FGDC demonstra- 
tion. The driving force behind the FGDC demonstration was a scenario^ written 
by Doug Nebert of the US Federal Geographic Data Committee (FGDC). That 
scenario called for use of a geospatial clearinghouse to discover spatial data and 
for the ability to instantly view the data sources without first having to do- 
wnload the data and load them into a local GIS. The demonstration was held 
in October, 1997 at the GIS/LIS exhibition in Cincinatti, Ohio in the United 
States. 




Specialist 
Data Access 



Fig. 1. Architecture and setup of the original FGDC demo. 



1.2 Brief Description of the Original FGDC Demo Setup 

Given the short amount of time and limited funding available to construct the 
demonstration, it was decided to use OpenMap^"'^ which has its roots in US 
Defense Advanced Research Projects Agency (DARPA) research. OpenMap is a 
middleware system, which at the time provided most of the functionality needed 
to achieve the stated goals. What was missing were the various OGC vendors’ 
implementations of the Open GIS Simple Features specifications. Four vendors 
(Bentley Systems, ESRI, Intergraph, and Oracle) each provided software which 
either implemented the nascent specification directly or which implemented por- 
tions of the specification. The vendors have subsequently upgraded much of the 
software that originally was used as they moved towards Open GIS conformance 
testing. 

® http : //www. fgdc . gov/publicat ions /documents/ clear inghouse/wwwmap_ 
scenario.html 

OpenMap is a United States trademark of BBN Corporation, a unit of GTE. 
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Figure 1 is a full component diagram of the original demonstration. Each 
grouping of modules represents an Internet-accessible service capable of provi- 
ding graphical representations of the underlying database. OpenMap provided 
the integration architecture. Each service group is connected to an OpenMap 
software agent known as a Specialist. Using the OpenMap Specialist protocol® 
developed for DARPA, the OpenMap client is able to connect to any combina- 
tion of the Specialist/ Vendor ’’stacks”. Further discussion of the architecture is 
provided below. 



1.3 Intentions of the GEObit Demo 

GEObit is an international trade fair for spatial information technologies and 
geo- informatics in Leipzig, which was organized for the first time in May 1998. 
The Open GIS Gonsortium wanted to use this trade fair as a platform to present 
Open GIS technology and to collect feedback from European GIS users, particu- 
larly from Germany and Eastern Europe. At the same time SIGAD Geomatics 
had started some development efforts to implement the OpenGIS®Simple Fea- 
tures Specification for SQL. This presented an opportunity to extend the FGDG 
demo with a SIGAD based server engine, the SIGAD Geospatial Data Server, 
and data typically used in German GIS applications. 

The choice of the data set was an important point. The FGDG demo em- 
ployed small-scale data with little detail. For the GEObit demo it was decided 
to use large- scale (1:1000) cadastral data. This kind of data is used not only 
for cadastral purposes; it is also the base for most GIS applications in Ger- 
man municipalities. The particular dataset used for GEObit was supplied by 
the city of Berlin, and using this dataset had two advantages. First, the kind of 
data was well known for most of the GEObit visitors so they were able to view 
the demo and give qualified feedback. Figure 3 shows a picture of this data as 
shown in the demostration. Secondly, this was an ideal test case to find out how 
the OpenGIS® Specifications apply to these data and which difficulties would 
be encountered. During the preparation work for the demo, it would also be a 
good chance to gain experience in how to build applications from interoperable 
GIS-components in a joint project between two companies. 



2 System Setup / Architecture 

2.1 OpenMap 

Rather than being a GIS, OpenMap is a geospatial middleware system, which 
allows rapid construction of information systems requiring a geospatial compo- 
nent. OpenMap consists of a set of Java components (since the GEObit demon- 
stration, these components have been converted to Java Beans®) which assist in 

® http : // j avamap . bbn . com/projects/matt /development/ specialist .html and 

http : //j avamap .bbn. com/projects/matt/development/source/Specialist . idl 
® http: //www. JavaSoft . com: 80/beans/docs/spec .html 
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the movement of geospatial information from a data source to the user’s screen. 
Inherent in the architecture is the ability to simultaneously connect to multiple 
data sources, locally or via the Internet. For the FGDC demonstration and the 
GEObit demonstration, the OpenMap connection to the data sources was via 
HOP (Internet Inter-ORB Protocol) [4]. Each OpenMap Specialist (a software 
agent running on the remote machine containing the data source) implements the 
same GORBA IDL (Gommon Object Request Broker Architecture Interface De- 
finition Language). In order to bind a Specialist to a data source a programmer 
can use object inheritance to specialize the Specialist class. In other words, using 
either G-l — h or Java, a programmer can use the code supplied with OpenMap to 
build the data access agent for any given data source. 

At its most basic level, a Specialist must be able to satisfy a single request 
from the client. This request is known as ’’GetRectangle.” GetRectangle is in- 
voked when the Java client’s view of the world changes. When a user zooms or 
pans the map, or changes the projection, the action results in the client code 
invoking GetRectangle on the Specialists it is connected to. Paramters of the 
GetRectangle request include the current projection, the window parameters, 
and the latitudes and longitudes of the map’s corners. Using this information, 
a Specialist can execute a spatial query into its database, find all the features 
that fall within the given area, and then use the geometries and attributes of 
the features to build a list of how the features should look on the map. For 
example, if a Specialist is connected to a database of roads when it receives a 
GetRectangle, it can find all the relevant roads, and then decide that the major 
roads are to be drawn as 3-pixel thick red lines and the minor roads are to be 1- 
pixel thick blue lines. Each feature’s graphics are constructed by the Specialist 
and added to a list. When all the features have been found, the list is returned 
via the HOP connection to the client code. 

Then the client code simply looks at the list and, one by one, draws the 
appropriate graphics on the screen. This works for any kind of data that can be 
turned into a series of graphic objects. In practice, the GetRectangle call also 
includes information that allows a Specialist to receive parameters other than 
just the projection parameters. Query parameters can be handled so that a given 
Specialist can be directed to constrain the search along non-spatial dimensions. 



2.2 SICAD Geospatial Data Server 

The SIGAD Geospatial Data Server^ is a commercial software package, deve- 
loped and distributed by SIGAD Geomatics, Munich. Its main purpose is the 
management and storage of geodata. The SIGAD Geospatial Data Server was 
first released in 1993. It is widely used, practically with every installation of 
SIGAD/open, and now available in its third major release. All major UNIX 
operating systems as well as Windows NT are supported. 

The SIGAD Geospatial Data Server is middleware based on standard RDBMS 
software from Oracle and Informix. It extends these RDBMS by geometric and 

^ http : //www. sicad. com/technology/ 
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graphic datatypes. Spatial indexing is done by use of a quadtree. For efficient 
storage and retrieval the geometric data is stored in binary format, using either 
the BLOB datatype in Informix or RAW in Oracle. All geometric data belonging 
to one cell of the quadtree is stored within one datum of type BLOB or RAW. 
This results in a spatial clustering of the data. 

The geographic objects in the SICAD Geospatial Data Server have a topo- 
logic structure in which common geometric primitives, e.g. a line shared by two 
adjacent areas, are stored only once. This means that geometries are stored with- 
out any redundancies. Together with the topologic relationships inherent to these 
data structures this data model is especially powerful for utility and cadastral 
applications. These applications are typically built to maintain large volumes of 
large-scale geographic data with frequent updates in a multi-user environment, 
and require high quality and accuracy of the data. Attribute information can 
be stored directly with the geometric data or in separate database tables. In 
the latter case, linkage information is maintained through the Geospatial Data 
Server. 

Besides the basic query, update and retrieval functionality, the Geospatial 
Data Server provides functions to work in distributed environments typical of 
large utilities or municipalities. This includes accessing and overlaying the con- 
tent of several databases, a central dictionary for distributed databases and 
advanced replication functions for differential update of remote databases. 



2.3 Demonstration Architecture 

The GEObit demonstration was implemented as a set of Specialists. These spe- 
cialists, subclassing the Specialist GORBA IDL, were compiled into G-l— I- classes 
for Orbix (a GORBA implementation from IONA Technologies, Ireland). Geo- 
data management functions were provided using a prototype implementation of 
OpenGIS® Simple Features Specification for SQL based on SIGAD Geospatial 
Data Server. Internet communication of the OpenMap client with Specialists 
was established by means of GORBA implementations deploying HOP. On the 
Windows NT Ghent, Visibroker 3.2 for Java (GORBA implementation from In- 
prise, formerly Borland) was used to compile and integrate the Specialist IDL 
into the OpenMap client. On the Sun Solaris 2.6 server, Orbix 2.3 for G-l— I- 
provided the ORB (Object Request Broker). With this system setup the suita- 
bility of GORBA technologies for inter-platform, inter-ORB and inter-language 
process communication could be tested. In the GEObit demonstration, access of 
an OpenMap client to a variety of geodata sets managed with different systems 
was presented. Geodata sets were 

• of different types (e.g. raster data of elevation model, vector data of point, 
line and polygon features), 

• of different scale (e.g. small scale political boundaries, large scale cadastral 
data) and 

• of different information communities (e.g. cadastral surveying, utilities). 
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“ Specialist 
' Data Access 



Fig. 2. Architecture and setup of the GEObit demo. 



Therefore, besides the DTED and DCW Specialists which were already com- 
ponents of the FGDC Demo, SICAD Geospatial Data Server Specialists for ac- 
cessing large scale cadastral and utility data for the city of Berlin were added 
(see figure 2). When OpenMap was set to a scale larger than an appropriate 
minimum scale, these new Specialists generated simplified graphics from exi- 
sting geodatabases, which were used for all GEObit demonstrations of SIGAD 
products. 

The graphic objects, after being received and presented in the OpenMap 
client, allowed direct interactions with the respective Specialist objects via proxy 
objects in OpenMap. These interactions included inquiry and presentation of 
alphanumeric data for a selected object whose graphical appearance also was 
altered in OpenMap. 



3 Semantic Issues 

3.1 Geometry Model 

The OpenGIS® geometry model, which is a basis for Simple Features for SQL 
implementations, doesn’t define topological relationships between geometric pri- 
mitives. This leads to redundancy of coordinates that exist in the geometry of 
more than one feature (e.g. border between two adjacent land parcels) since 
geometric primitives can’t be shared by features. 

As explained above, the SIGAD Geospatial Data Server implements a to- 
pologic geometry model that controls redundancy. Because capturing, updating 
and storing topologic structures of geographic objects is efficiently supported by 
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the SICAD geometry model, conversions were needed in the implementation of 
Simple Features for SQL. 

Geometry conversions for generating Well Known Structures for OpenGIS®- 
Geometry could be achieved simply by navigating through topological structures 
gathering and ordering the coordinates of visited geometric primitives. But, since 
shared geometric primitives are visited multiple times in the course of generating 
Well Known Structures for the respective features, redundancy is introduced in 
the process of conversion. Therefore, it is very difficult to reproduce the topolo- 
gical structures by converting back from the generated Well Known Structures. 

This is especially true for the SIGAD Geospatial Data Server, which allows 
for controlled redundancy of geometric primitives. The Simple Features for SQL 
implementation is suitable to generate graphics for GIS viewers like OpenMap, 
but not for the exchange of geodata with GIS that are using topological geometry. 

3.2 Definition of Geoobjects / Features 

When working with geodata sets of different scales produced for different infor- 
mation communities, one finds that definitions of features as classes of geoobjects 
are hardly comparable. This affects not only definitions of feature attributes, but 
also their borders and geometry types, because they differ significantly. For ex- 
ample, roads in a cadastral application are defined by land ownership and are 
associated with Polygon geometries. This makes usage of existing cadastral road 
features (e.g. by transportation applications) difficult, since their road features 
are probably continuous between two nodes in a road network and associated 
with LineString geometries. Interoperability of these applications could be achie- 
ved, for example, by converting cadastral Polygon geometries to LineString geo- 
metries and assemble them in a MultiLineString to generate a transportation 
road feature from the cadastral road parcels. The prerequisite for such intero- 
perability is a feature attribute in the cadastral application that can be used to 
select the road features which form a respective road feature in the transporta- 
tion application. 

In the GEObit demonstration the DGW (Digital Ghart of the World, scale 
1 : 1,000,000) Specialist generated a small-scale OpenMap layer with the net- 
work of major roads, instead of using a SIGAD Geospatial Data Server Specialist 
to generate a road network by converting cadastral geodata. Overlaying these 
OpenMap layers, the limitations of cross application analysis (e.g. finding the 
owners of land parcels adjacent to a major road) became obvious since the ac- 
curacy of the underlying geodata sets differed significantly. 

3.3 ALK / ATKIS (German National Standards) vs. Simple 
Features 

The Open GIS Gonsortium aims to provide a common object model that facili- 
tates interoperability of GIS products and applications. On the German national 
level a standardization initiative was begun by the Association of Authoritative 
Surveying Administrations (AdV) many years ago. The latter initiative led to the 
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production of cadastral (ALK) and topographic (ATKIS) base geodata of high 
accuracy covering most of Germany already. The underlying well-defined object 
model of hierarchical structured geoobjects associated with geometric primiti- 
ves in a topologic geometry model was implemented by a variety of commercial 
GIS and database applications. Between them, geoobjects can be exchanged, 
and log files generated in the course of modifying geoobjects for correction with 
one GIS can be used to update replicated geodatabases managed with ano- 
ther GIS or database application. In the revision of the object model (ALKIS), 
functional interfaces are also introduced as methods of the geoobjects. In the 




Fig. 3. Screen capture of demonstrated Berlin data. 



GEObit demonstration, existing cadastral base data, managed with SIGAD Ge- 
ospatial Data Server using the ALK object model, were converted to OpenMap 
graphic objects in a Specialist based on a prototype implementation of Simple 
Features for SQL. The conversion between the different geometry models was 
already explained above. Additionally existing arc and spline geometries had 
to be approximated by LineString geometries since they aren’t defined in the 
OpenGIS® geometry model yet. 

Gonversion of object structures was only possible by omitting some of the 
associated data since complex textual annotations and geometric primitives used 
for portrayal still have to be defined by future OGG specifications extending the 
Simple Feature specifications. Therefore, graphical appearance of a cadastral 
layer in the OpenMap client didn’t meet the requirements imposed on cadastral 
GIS applications in Germany. 
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3.4 Coordinate Transformation 

Since the OpenMap client requires that coordinates of graphical objects crea- 
ted by a Specialist be provided in WGS84 coordinate system, the coordinate 
transformation implemented in SICAD Geospatial Data Server was applied to 
existing geodata stored in Soldner Berlin New coordinate system. 

4 Technical Issues 

4.1 Operating Systems 

The Operating System differences were not an issue for this demonstration, 
because of the Java implementation of OpenMap and the use of GORBA imple- 
mentations to provide IlOP-based object communication. 

4.2 CORBA ORB 

The GORBA HOP connection between OpenMap and the Specialist was made 
up of products from two different vendors. The OpenMap specialist layer is ba- 
sed on the OpenMap Specialist IDL. Inprise Visibroker 3.2 for Java was used on 
the client side (OpenMap) of the interface, with the IDL compiler used to create 
the client interface code stubs as well as the Visibroker GORBA implementation 
jar (Java Archive) files used at runtime from the Specialist IDL. Iona’s Or- 
bix, version 3.2 was used to create the server-side (Specialist) G-l— I- code stubs. 
The Orbix GORBA shared libraries, combined with OpenMap specialist support 
software, were used to create the G-l— I- code for the specialist, running on a Sun 
Microsystems Solaris 2.6 machine. 

The main technical issue with GORBA is, of course, to get the client and 
server communicating with each other. The Specialists wrote their Interoperable 
Object Reference (lOR) into a specific file whenever they were launched, and 
they wrote that file at a predetermined location accessible via an HTTP server. 
When the OpenMap client was started, it used the lOR information to contact 
the Orbix ORB. The Orbix ORB is a daemon process responsible for setting 
up communication between the client and the server. While the Specialists were 
set to use a dedicated port number for their communication, the ORB was still 
needed to handle the initial contact between the client and the Specialists gra- 
phical objects. The ORB, after receiving the request from the client, performed 
the handshaking required to get the client communicating with the respective 
Specialist server. 

The only pressing technical problem was the repeated crashing of the Orbix 
ORB. For the GetRectangle calls to the Specialist, where the OpenMap client 
requested graphical objects to fill the screen for the geographic scale and location 
of the OpenMap window, the ORB was stable and performed as expected. The 
problems occurred when OpenMap was in the gesture mode. In gesture mode, 
the ORB was responsible for handling the message passing required to update 
the OpenMap display with identifying strings as the user moved the mouse over 
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the objects on the screen. It was during this handshaking procedure that the 
Orbix ORB kept crashing, preventing the initiation of more message passing or 
GetRectangle responses. The ORB was crashing due to some internal memory 
management problems that could not be reliably reproduced at other physical 
locations. 

The solution for this problem was to launch the ORB and have it use a log 
file to keep track of its state. The ORB has the capability to be started with a 
log file, causing it to be set to a certain state, aware of server objects that it can 
be responsible for. The ORB run command was inserted into an infinite loop, 
which restarted the ORB after every crash in its pre-crash state. The effect on 
the OpenMap performance was negligible. While the solution may appear to be 
distasteful, it was the path recommended by Iona on their website, to make their 
ORB more stable and persistent. 

4.3 Internet Process Communication through Firewalls 

In the case where OpenMap was being run as an applet in a web browser, the 
Visibroker Gatekeeper was required in order to bypass the inherent security 
restrictions imposed upon applets. Since an applet is only permitted to commu- 
nicate with the computer it was loaded from, the Gatekeeper server was loaded 
on the applet serving computer and acted as a conduit for HOP messages passing 
between the applet, ORB and Specialist. 

The GORBA 3.0 Firewall specification addresses this problem by defining 
interfaces for passing HOP through a firewall. It includes options for allowing the 
firewall to perform filtering and proxying on either side. This is very important 
for extending the secure use of GORBA to the Internet and across organizational 
boundaries. 



5 User Feedback 

Most interest and feedback came from representatives of larger German munici- 
palities. These city administrations typically are using GIS software in a broad 
range of applications and departments. Many of them are using GIS software 
from different vendors and have difficulty sharing any geodata between depart- 
ments due to the proprietary character of the software used. One representative 
of an East German city reported about six different GIS in use for the city 
administration . 

The main question was if these governmental users would accept the simplifi- 
cation of official data, which was necessary in order to use the OpenGIS® Simple 
Feature Specification. This concerned mainly the graphical representation of data 
(which is not defined at all in the mentioned specification) and the missing map 
annotation (which is also not defined). It is to note here, that German authorities 
are using very sophisticated cartographic styles, especially in city planning appli- 
cations. But without any exception everybody said that he would tolerate these 
deficiencies if he could gain direct access to heterogeneous sources of geodata. 
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The other very positive feedback came to the fact that the demonstration 
was allowed to query feature properties directly from the server, avoiding the 
overhead of transferring them at load time. It seems to be that more and more 
users consider a GIS also as just another user interface to access arbitrary data. 
Keeping database connections open and allowing the user to drill down by queries 
or navigation seems to be one of the major advantages of the OpenMap approach. 

6 Summary 

The importance of practical testing of specifications and ideas cannot be overemphasi- 
zed. In the course of constructing the two demonstrations and in their operation at the 
respective locations in Cincinatti and Leipzig, the authors have learned many lessons. 
As discussed in the preceding sections, there were issues regarding software interope- 
rability and quality, issues regarding data modelling differences, issues regarding data 
visualization, and issues regarding data scale differences. These kinds of issues were not 
necessarily unanticipated, but were nonetheless hard to conceptualize without actually 
trying things in an operational setting. Furthermore, it is much easier to revisit the 
interoperability specifications under development in the Open GIS Gonsortium once 
these lessons have been learned and recorded. An additional benefit of this kind of test- 
bed is the opportunity to show previously uninitiated people what the benefits of the 
new technology can be. While the end-user community can read about Open GIS and 
interoperability, one cannot impart the full message about Open GIS without actually 
demonstrating it. The feedback from users can also influence the relative importance 
which is given to various parts of the specifications as revisions are made or new ones 
are developed. 
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Abstract. Digitally capturing and maintaining geospatial data is ex- 
pensive. The consequence is that there is a need for the integration of 
geoprocessing software into mainstream computer technology and a de- 
mand to share geospatial data. This demand includes consequently the 
sharing of specific graphic views on it. In this paper an implementable 
framework is presented for defining graphic views on geospatial data. 
This is based on an existing geospatial data definition language and a 
transfer format, called INTERLIS. 



1 Introduction 

People always looked at geographic or geospatial data through maps which are 
usually preseuted ou paper media. With the fast progress of hardware techuology 
aud the adveut of the World Wide Web the geospatial data more aud more is 
also displayed ou colored high-resolutiou screeus. But digitally capturiug aud 
maiutaiuiug geospatial data is expeusive which is also the case for the specialized 
computer systems that are still iu use today. The cousequeuce is that there is 
a ueed for the iutegratiou of geoprocessiug software iuto maiustream computer 
techuology aud that there is a great demaud to share geospatial data for the 
widespread use [4]. 

The same applies also to the productiou of paper aud screeu maps that are up- 
to-date aud easily to reproduce, preferable with differeut geographic iuformatiou 
systems (GIS). Therefore it is stated that the ueed for shared geospatial data 
cousequeutly leads to deriviug aud shariug specific graphic views ou it. 



1.1 Federated and Interoperable Geographic Information Systems 

Sharing geospatial data implies that heterogeneous GIS are “open” to a certain 
degree and are able to exchange data freely [8]. This can be called a data- 
centered approach because explicit description is especially important as well as 
rules that allow for the encoding of objects to a corresponding data access service 
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[15]. The access, query and manipulation of geospatial data is done through a 
data manipulation language or through an application programming interface 
(API). This is called a process-centered approach. In realistic environments both 
approaches are required for “interoperability” being “the capability for two or 
more computers to transmit data and to carry out processes as expected by the 
users” [16]. 

Because of the diversity of how GIS implemented geoprocessing [1] [18] and 
because data lives about ten times longer than software [20] we focus our stan- 
dardization efforts on the data-centered approach. 

It is assumed that the current geospatial infrastructure consists of federa- 
ted networked systems which means that basically independent, heterogeneous 
systems are committed to follow common rules [15]. A user or an information 
community wants to become sometimes a data or service provider or a data or 
service consumer. Jones [14] wrote in a similar sense that “Multi-participant, 
cooperative GIS means that governments and agencies at the local, state, and 
federal levels and the private sector become partners in establishing and main- 
taining a GIS. Naturally, each ‘partner’ may participate in a different way”. 



1.2 Decomposing Geospatial Objects 

Abstraction and decomposition are common techniques in information techno- 
logy. This separation can also be applied to geographic information [18] and its 
consequent application is one of the prerequisites for the integration of geospatial 
data structures and geoprocesses into mainstream information systems. 

A geospatial object can be decomposed in different aspects [15]: 

1. Geometric, georeferenced attribute types (point, polyline, surface, area, so- 
lid) 

2. Textual attribute types (string, number, date, time, etc.) 

3. Other special types like a relationship attribute type 

4. Possibly a reference to one or many graphic presentation definitions 

5. Generic methods - like create, update or delete - and more complex ones 

At this conceptual level geometric attribute types, for example, are treated 
like any other (complex) attribute type. This also means that one object can 
have zero, one or many geometric attributes. Note, that from a geospatial data 
modeling view there is no text label which denotes a certain attribute of a 
geospatial object. A text label belongs rather to the presentation model than to 
the geospatial data model. 



1.3 Modeling Geospatial Objects 

Openness of systems and common rules for data management are established 
through standardization. A well-defined geospatial data definition language - 
sometimes also referred to as a conceptual schema description language - that 




Modeling and Sharing Graphic Presentations of Geospatial Data 153 



is based on predefined data types (as mentioned above) is like a “lingua franca” 
between different specialists and at the same time it remains computer readable. 

With such a standardized language any user community has a common un- 
derstandable and flexible tool for modeling geographic information. The result 
of this activity is a conceptual “user application schema” . A conceptual schema 
is a crucial concept in database modeling which abstracts from system specific 
implementation issues and establishes a data structure common to the involved 
systems [6] [10] [4]. 







Map 



Fig. 1. The presentation model of INTERLIS: this describes graphic presentation as a 
multi-phase process which consists of a selection step and a mapping step which relates 
queried geospatial objects to graphic symbols. 



1.4 The Graphic Presentation Process 

Graphic Primitives Geospatial data is being visualized on paper maps or 
screens by use of elementary, generic graphic objects called graphic primitives. 
In the 2D case there are the following graphic primitives: 

— Text: Text primitives have properties like fixed text, size, font, color etc. 

~ Point: Point (or icon) primitives consist of size and a line, surface combina- 
tion 
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~ Line: Line primitives have properties like width, dash style, patterning style, 
etc. 

— Surface: Surface primitives may be filled, outlined, hatched, etc. 

These graphic primitives are also objects that can be modeled with a geospa- 
tial data definition language. 



Symbology Library A symbology library contains text, line, surface and point 
symbol object instances. Symbology library symbols that are defined and referred 
to by graphic primitives in a application specific graphic model (or schema) 
determine the final rendering on a graphic output device. 



Selection, Mapping and Rendering The present research interest in the 
domain of graphic presentation of geospatial data seems to be rather low [13] 
[7] [9]. In [5] a language is proposed which consists of two components, a query 
language to describe what information to retrieve and a presentation language 
to specify how to display query results. 

“The OpenGIS Abstract Specification” is a current extension of [4] which 
includes a topic volume called “Essential Model of Interactive Portrayal” . The 
definition of portrayal is very similar to what is called here a “graphical presen- 
tation” . In that specification the portrayal process is decomposed into the sub- 
processes Selection (including query constraints). Generating Display Elements 
(dealing with style) , Rendering (dealing with image constraints) and Displaying 
(taking device characteristics into account). 

Based on our research and implementation experiences it can be generally 
stated that the process of graphic presentation of geospatial data consists of the 
following steps [1]: 

1. What is to be presented? The relevant object classes are selected. The infor- 
mation needs exactly to be communicated according to the intended message. 

2. How are the resulting objects represented? The resulting objects of step 
one are mapped to graphic symbols (symbolized) such that the non-graphic 
information is “transcribed” to the graphic primitives in an unambiguous 
way [2]. 

3. How are the graphic symbols rendered and displayed on a graphic output 
device? 

Step one can be resolved with a spatial and thematic query statement, possi- 
bly complemented with a filter (c.f. [13]). Step two defines how graphic symbols 
are generated and implies all problems of visualization, generalization and gra- 
phic layout. Step three is a combination of rendering and displaying. Step one 
and two are enough to explicitly describe any symbol; step three is device de- 
pendent and therefore not subject of data-centered standardization. 

This approach is the key for sharing not only the geospatial core data and its 
related schema but also the conceptual definition of specific graphic views among 
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difTerent systems. Therefore, one can define the graphic presentation process of 
geospatial data as follows: 

Graphic presentation is a multi-phase process that eonsists of a selec-tion step 
and a mapping step which relates queried geospatial objects to graphic symbols; 
they in turn are rendered and displayed on system specific graphic output devices 
In the next section it is shown that abstraction and decomposition can also 
be applied to the graphic presentation of geospatial data and how it was imple- 
mented in INTERLIS. 

2 The Approach of INTERLIS 

Faced with federated GIS in the Swiss government the Directorate of Cadastral 
Surveying initiated a standardization process in the 1980s with a small project 
team. This team consists of technical representatives from industry, universities 
and government. The state-of-the-art in information technology - specifically 
database theory - has been analyzed. Then, instead of compiling an ultimate 
all-purpose feature or object catalogue (see for example [17]), the project team 
first specified a textual, geospatial data definition language, called INTERLIS 
[ 11 ]. 

2.1 A Data Definition Language and a Transfer Mechanism 

In INTERLIS encoding rules are attached to the geospatial data definition lan- 
guage, which specify for example the encoding into a file format. This enables 
to communicate not only the description of the data structure, as a mandatory 
part of meta-data but also the object instances without worrying e.g. about 
object codes. In other words the georeferenced objects are self-describing their 
structure. This information can be accessed before the data is actually transfer- 
red. Ultimately this enables so called “plug-and-play” data sets. See [11] for a 
complete implementation specification of such a mechanism. 

2.2 The Mapping Language and the Presentation Model 

To describe and share graphic primitives with their properties a standard pre- 
sentation model is introduced, which is defined itself in an INTERLIS schema. 

Specific graphic primitives of a graphic presentation have a well-known re- 
lation to geospatial objects. Therefore a graphic presentation can be seen as 
functional mapping of data model objects to presentation model objects: 

GraphicPresentation = F unction{Geo spatial D ataM odel) . (1) 

The graphic primitives generated by the graphic presentation function ac- 
cording to Equation 1 exist only temporary. There is no need to store them in a 
database because they can always be generated from geospatial data objects. 

In typical computer aided drawing systems it could be argued, for example, 
that there exists no explicit data model; it seems that there are only graphic 
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primitives stored directly in the system. But in fact there is an internal fixed 
data model from which graphic objects are generated by a system dependent 
mapping function at run time. 

Finally every graphic presentation has to be created on some existing output 
device. Here the established graphic standards come into play, therefore this is 
considered as being outside the scope of this new standardized framework. 

2.3 Summary of the Process 

This is a summary of the whole process (see Fig. 1 and [12]): 

— First a selection of relevant object classes is performed. 

— Through the mapping function (1) the geospatial objects are mapped to gra- 
phic primitives. For the specification of this a selection part of the geospatial 
data definition language is needed as well as an interface to symbols being 
application specific instances of graphic primitives. 

— The graphic interface definition (consisting of specific procedure calls) descri- 
bes the graphic primitives text, line, surface and point symbol. The graphic 
primitives exist only temporary whereas the referenced symbols are persi- 
stently stored and maintained in a system specific manner. 

— The presentation model consists of a graphic interface (i.e. an abstract inter- 
face definition) and a symbology library. The presentation model is defined 
in INTERLIS and will be called ILI_PRESENTATIDN or SYMBOLOGY. 

— The symbology library contains text, line, surface and point symbol definiti- 
ons. The symbology library determines the rendering of a graphic primitive 
to an output device. 

— Graphic primitives are rendered and displayed by a driver program to a 
graphic output format or output device. 



2.4 An Example 

This example is taken from an existing cadastral surveying application schema 
and shows geodetic points that have to be presented in scale range lower or equal 
than the sclae number 500 (1/500) and above 500 (e.g. 1/10,000). In scale range 
below 1/500 a text and a point symbol for each geodetic point are displayed. 
Different symbols should be used for different kinds of geodetic points (i.e. GPl 
or GP2). In scale 1/10,000 only the GPl symbol is specified, there will be no GP2 
symbols displayed and there is no accompanying text specification. 

MODEL BaseModelMO = 

TOPIC Geodetic_Points = 

TABLE Geodetic_Point = 

Number: TEXT* 12; 

NumPos: Point2D; 
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NumDri: OPTIONAL LabelDri; !! Default: 100.0 



NumHAli: OPTIONAL HALIGNMENT 
NumVAli: OPTIONAL VALIGNMENT 
SymbolOri: OPTIONAL LabelOri 
Geometry: PointSD; 



Default: Center 
Default: Half 
Default: 0.0 



! ! etc . 

Type: (GPl, GP2) ; 

Lineage: OPTIONAL TEXT*30; 
CONSTRAINT 



UNIQUE Number; 
UNIQUE Geometry; 
END Geodetic_Point ; 



END Geodetic_Points . 



END BaseModelMO . 



GRAPHIC MODEL GraphicModelGl 
DEPENDS ON BaseModelMO 
BASED ON SYMBOLOGY = 

PRESENTATION 

Resolution; ! ! User defined paramieters 

TOPIC Geodetic_Points = 

VIEW Geodetic_Point_View = 

ALL OF BaseModelMO . Geodetic_Points . Geodetic_Point ; 

END Geodetic_Point_View; 

GRAPHIC Geodetic_Point_Symbology 
BASED ON Geodetic_Point_View = 

MyGPl_Symbol: WHERE Type == "GPl" 

SIGNATURE SYMBOL 

Symbology: "GP500_GP1"; 

Geometry: Geometry; 

Orientation: SymbolOri; 

Priority: 100; 

END; 

MyGP2_Symbol : WHERE Type == "GP2" && Resolution <= 500 
SIGNATURE SYMBOL 

Symbology: "GP500_GP2"; 

Geometry: Geometry; 
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Orientation: SymbolDri; 

Priority: 100; 

END; 

MyGP_Textl: WHERE Resolution <= 500 
SIGNATURE TEXT 

Symbology: "Helvetica-lO-normal" ; 
Text: Number; 

Geometry: NumPos; 

Orientation: NumOri 
HAlignment: NumHAli; 

VAlignment: NumVAli; 

Priority: 110; 

END; 

MyGP_Text2: WHERE Resolution > 500 
SIGNATURE SYMBOL 
Name: "GP10000_GP" ; 

Geometry: Geometry; 

Orientation: SymbolOri; 

Priority: 100; 

END; 

END Geodetic_Point_Symbology ; 

END Geodetic_Points . 

END GraphicModelGl . 



The first model (BaseModelMO) defines the structure and the constraints 
of the base data. A separated graphic model, called GraphicModelGl, refers 
to BaseModelMO. Then the INTERLIS standard presentation model, ILI_PRE- 
SENTATION or SYMBOLOGY, is referred to because in the graphic presentation of 
the objects are going to be defined. With the keywords BASED ON SYMBOLOGY 
the graphic interface and the symbology library are inherited. The following 
PRESENTATION clause allows to define parameters which can be used later on to 
control certain things. In this case the parameter Resolution is being decla- 
red. After the opening of the topic Geodetic_Points, all objects from table 
Geodetic_Point are selected in a database view (Geodetic_Point_View). The 
keyword GRAPHIC introduces the graphic definitions and refers to the before 
established view. Then, the named graphic definitions follow. A WHERE filter can 
optionally inserted before the SIGNATURE declaration introduces the definition of 
a graphic primitive. This is the second process step which maps objects to pre- 
defined symbols. For each presentation definition the graphic interface is called 
with the arguments, like: 

MyGPl_Symbol : 

SIGNATURE SYMBOL 
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Symbology; "GP500_GP1"; 
Geometry: Geometry; 
Orientation: SymbolOri; 
Priority: 100; 

END; 



The argument Priority of each symbol definition, for example, controls 
the display priority (e.g. with value 100). The graphic interface of the SYMBOL 
signature was inherited from SYMBOLOGY together with the standard presentation 
model. The symbols - or display elements - are assigned by reference with a 
unique name (e.g. MyGPl_Symbol). The management of these graphic primitives 
is system and device dependent but they are neutrally encoded in a symbology 
library which is structured according to the standard graphic presentation model. 



2.5 Discussion 

The proposed framework has been implemented using commercial GIS and draft 
INTERLIS Version 2. Based on these experiences we can discuss the gained 
insights. 



One Way Process Prototype implementation of the proposed approach showed 
that graphic presentation can be seen as a one way process, without any manual 
interaction. The base data and the related processes are processed from the left 
of Fig. 1 to the right. 



Derived Geospatial Models By collecting the graphic presentation defini- 
tions in a separate model the flexibility of different views on the same spatial 
object is also structurally properly expressed. This is like encapsulating the pre- 
sentation model definition from the geospatial data model. Manual editing of 
base data is still possible within this framework given that any change in the 
structure is being modeled and stored as a separate object. The table defining 
these objects can either inherit the structure of the original base table or make a 
relationship to it, so that only these attributes need to be stored that really have 
changed. The consequent implementation of this framework enables multi-scale 
databases where changes in the base model are propagated to the derived models 
(see Model Ml in Fig. 1). 



Process Chains In Fig. 1 there are process chains indicated. These are subpro- 
cesses consisting of procedures, which can be automated given the information 
is there from the geospatial objects or the presentation function calls. Exam- 
ples are calculating the offset of a text label up to automated name placement 
algorithms. 
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Generalization Model and cartographic generalization processes are hard to 
automate [19]. Those who are possible to handle by the computer can be in- 
tegrated as a subprocess mentioned above. For those generalization problems, 
which require manual editing derived geospatial models are proposed. 



2.6 Compatibility with Other Standards 

There are not many standards in this domain. The international standardization 
body, ISO [13], rather provides a general framework for interoperability where 
implementation standards have to comply. 

Currently, the Open CIS Consortium [4] is an emerging standardization in- 
itiative from industry which has a rather process-centered technical approach. It 
tries to standardize e.g. simple feature (or object) types and feature catalogue 
services. With this service the properties of an object can be accessed through a 
API call. This is very similar to a specific schema. While this is useful for interac- 
tively exploring geospatial data sets this is probably not adequate for getting an 
overview of the overall structure and the relationships. Certainly it is not meant 
as a standard tool to model a user application schema of a user or information 
community. As mentioned before a working group for the “Interactive Portrayal” 
of geospatial data in the World Wide Web has recently been constituted. 

The approach presented here could complement these efforts in many res- 
pects: INTERLIS delivers a compact, unifying geospatial data definition langu- 
age including a presentation part and a file format. Therefore it fits well into 
the international standards so far and will continue to enhance the means for 
information communities to document their needs and share their geospatial 
data. 

3 Conclusions and Future Work 

The framework presented here has been tested in a prototype and the benefits 
can already be identified: graphic presentation definitions and even symbology 
libraries can be shared and distributed in a federated systems environment. 
The vision is to enable distributed providers who are specialized on graphic 
presentation services. 

Taking the diversity of present CIS into account it is almost impossible to 
agree on a common set of attributes for defining graphic symbols, so we had 
to choose a balance between completeness and implementation costs. Therefore 
further work is required for assessing the proposed standard presentation model. 
Although this framework is based on 2D it can be extended to 3D in next ver- 
sions by introducing solids. But due to unresolved geoprocessing and geospatial 
modeling issues this remains to be done. 

From a process-centered view identifying appropriate process chains taking 
into account the properties of the chosen output device is very important. Very 
important is also the investigation of the human part in the process for allocating 
interactive processes like continuous feedback while editing. 
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Additional experiments need to be made in order to exploit the full range of 
cartographic problems in GIS, especially the interoperability of graphic presen- 
tation of geospatial data. 
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Abstract. The Open GIS Consortium has dehned and classified two 
dimensional geometries for the SQL92 Data Model Architecture in order 
to improve interoperability of spatial information systems. This specifi- 
cation was used as a reference to evaluate the implemented data models 
of two commercial GIS packages. It is concluded that the data models of 
both commercial packages are not OpenGIS compliant. They generally 
follow the rules as set by the Open GIS Consortium, but there are still 
deflections. These deflections are related to the concepts of the package 
developers and the way geometric entities are handled in their software. 

The fact that deflections exist should encourage a more in depth study of 
the OpenGIS Data Model Architecture in order to capture all occurren- 
ces of geometric phenomena. A first contribution was made by examining 
geometries in the Dutch Topographic Map scale 1 : 10.000. 

1 The Challenge 

In 1994 the Open GIS Consortium (OGC) was launched with the mission of 
improving the interoperability of GIS. At the heart of this mission was a need 
to find a common standard for the structuring of topological entities, among 
which the structuring in a SQL environment. This common standard has evolved 
into OGC “Simple Features Specification For SQL” (NN, 1998). However, from 
a practical point of view this specification is only of value once it is embodied 
within systems that are used in practice. In 1997 two major commercial packages 
came on the market. Environmental Systems Research Institute Spatial Data 
Engine (SDE) 3.0.2 for ORACLE version 7 and ORACLE Corporation Spatial 
Data Option (SDO) 7.0 for ORACLE version 7.3.3 ( and later, with the same 
functionality Spatial Data Cartridge (SDC) 8.0 for ORACLE 8.0). Both with the 
intention of using the OGC standards as the basic building block for structuring 
vector databases in Relational Database Management Systems (RDBMS). 

It was a challenge to evaluate how well SDO and SDE apply the OGC spe- 
cification and furthermore establish whether or not the practical application of 
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these tools and standards meets the needs of users for the topological struc- 
turing of geographic data. To do this a case study data set from the Dienst 
Landelijk Gebied in the Netherlands was implemented in both SDO and SDE ( 
Cattenstart, 1998). 

The concepts and implementations of spatial data in relational databases 
cover many aspects like storage structure, data format, indexing and search al- 
gorithms. This article concentrates on the OGC spatial Data Model Architecture 
for two dimensional SQL92 oriented environments and its practical impact. 



2 The OGC Vector Data Representation 

“OpenGIS” can be defined as “the interoperability of geospatial data and geo- 
processing resources” (NN, 1996). In practise it means that data resources and 
the calling of process functionality can be shared between different (parts) of 
organisations and software packages that process spatial data. To the user data 
storage should be transparent between systems. This wish calls for a uniform all 
inclusive data model in order to communicate geometric data correctly. 

A spatial object consists of one or more geometries which consist of data 
types. The following geodata types for two dimensional space are specified in 
the Open Geodata Model: 

1 . Point - a zero dimensional topology. This type specifies a geometric location, 

2. Curve - a one dimensional topology. This type specifies a family of geometric 
entities including line segments, line strings, arcs, b-spline curves, and so on, 

3. Surface - a two-dimensional topology. This type specifies a family of geo- 
metric entities including areas and surfaces that are defined in other ways. 

In the Open Geodata Model it is possible to combine elements of the same 
data type to a geometry collection. In Fig. I the building blocks for the geometry 
specification and their hierarchy are shown. In the vector representation there 
is an other kind of element that is of great importance. It can be named Node 
which has the spatial characteristic that it is an intermediate location on a curve 
or surface boundary. A Node can have attribute data. A Node might for example 
be a dam in a river, a crossing of roads, or the junction of three state borders. 
The most significant difference with Point is that Node is directly coupled to 
geodata type Gurve or Surface boundary and that it is not necessarily located 
on an explicit co-ordinate. Fig. 1 could be extended as shown in Fig. 2. 

It is recommend that OGG clarify the notion of Node. 

The SDO data model also identifies the geometric entities Point, Gurve and 
Surface as well as Multi element geometries (NNa, 1997). SDO has no explicit 
implementation of the entity Node. However, the user can define a co-ordinate 
to act as a Node. 

The SDE data model identifies the geometric entities Nil, Point, Gurve and 
Surface and also Multi element geometries (NNb, 1997). The Nil entity is in- 
troduced as an object without contents as the physical outcome from a spatial 
function with no geometric interaction. SDE has no explicit implementation of 
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Fig. 1. Entity building block hierarchy 
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Fig. 2. Proposed extension of the entity building block hierarchy with Node 



the entity Node. SDE uses an attribute called “Measure” in order to attach 
attribute values to a specific co-ordinate. 



3 The SQL92 Implementation Model Data Architecture 

The OGC Model Storage Architecture has on object based approach. This means 
that every geometry is self supporting, not dependent of any other entity. Rela- 
ted to administrative systems this is an advantage since the geometric entity can 
be fully incorporated within the rules of set theory mathematics. For example, a 
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parcel may have several contracts, in which case the geometry might occur in the 
database as many times as there are contracts. Discarding a contract will invoke 
discarding the associated geometry. This will not effect the other contracts since 
they have their own geometry specification of the same parcel. On the other 
hand the geometric data structure lacks any kind of normalisation. Co-ordinate 
duplication occurs in any instance where geometries interact. For example, when 
roads are presented as lines and a junction is considered, the co-ordinate of the 
junction is stored as many times as there are lines ending/starting at the jun- 
ction. When polygon entities are considered, two polygons that are neighbours 
have a common boundary. The co-ordinates of this boundary are stored separa- 
tely for each polygon and are thereby present in the database in duplicate. OGC 
embraces actual developments in the CIS world as they are. However, with this 
trend OGC moves away from an essential property of spatial data: Intrinsic to- 
pological relationship. Shared co-ordinates only have to be stored once and when 
correctly referenced reveal spatial relations by simple searching in tables in stead 
of computer intensive relationship calculations ( Cattenstart, 1998). Since this 
is not the case determining topological relationships is in the hands of software 
developers who have their own concepts and implementations in this respect. It 
is recommended that OGC investigate this perception. 

4 The OGC, SDO, and SDE Data Models Compared 

In this chapter the OGC data model is used as a reference to evaluate the SDO 
and SDE data model. For each geometric data type the OGC assertions and if 
applicable special occurrences are presented. For each assertion and occurrence 
presented it is denoted if it is supported by the SDO and SDE data model. In 
some cases support is unkown due to imperfections of the checking routines in 
the packages. 

In the OGC concept geometric elements are categorised into simple and non 
simple geometries. OGC has defined extensive (set theory based) rules to distin- 
guish simple from non simple geometries (NN, 1998). This distinction is made in 
order to correctly communicate geometries and their properties between systems. 

4.1 The Point Data Type 

The OGC definition of a Point is that it is a zero dimensional geometry and 
represents a single location in co-ordinate space. 

OGC has not determined special occurrences of Point. 

4.2 The Multipoint Data Type 

The OGC definition of a Multipoint is that it is a zero dimensional geometric 
collection. A Multipoint is called simple in the OGC data model if no two Points 
in the Multipoint are equal (have identical co-ordinate values). Multipoint as 
defined by OGC is supported by both SDO and SDE. SDO and SDE support 
simple and non simple Multipoint. 
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Table 1. Assertions for Point data type 



OGC .'\ssertion 


SDO 


SDE 


Has an ordinate value for each dimension 


Yes 


Yes 



Table 2. Assertions for Multipoint data type 



OGC Assertion 


SDO 


SDE 


The elements of a Multipoint are rcstrieted to Points 


Yes 


Yes 


The Points are not eormected or ordered. 


Yes 


Yes 



4.3 The Curve Data Type 

A curve is a one-dimensional geometric object stored as a sequence of points, 
with the subclass of curve specifying the form of the interpolation between points. 
The OGC specification defines only one subclass of curve, LineString, which uses 
linear interpolation between points. This is also the case in SDO and SDE. The 
set of OGC assertions covers the assertions in SDO and SDE. A Curve is called 



Table 3. Assertions for Curve data type (subclass LineString) 



OGC Assertion 


SDO 


SDE 


Each Curv'e has a start point and an end point 


Yes 


Y'es 


The boundary of a Curve consists of its start point and end points 


Yes 


Yes 


The sequence of points determine the global morphology 


Y'es 


Y'es 


A Curve is defmed as topologically closed ( = no gaps) 


Yes 


Y'es 


Each consecutive pair of points defines a I.inc segment 


Yes 


Y'es 


Only one algorithm detemiining the form of all Line segments 


Yes* 


Yes 



* In ORACLIi Si-Spatial extended. In Si-Spaiial a curve may consist out of sections. Each section can 
have a different algorithm that determines its .shape. 



simple by OGC if it does not pass through the same point twice. In the SDO 
data model there is no distinction between simple and non simple curves. A curve 
may cross itself. However, a self-crossing LineString does not have an implied 
interior. In SDE the user can determine if curves may self cross. 
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The distinction between simple and non simple curves has great impact on 
geometry classification. The next set of special occurrences are considered: 



Table 4. Occurrences of Curve data type (subclass LineString 



OctuiTence 


OGC 


SDO 


SDE 


Slart poinl equals end poin( 


Closed 


Yes 


Yes 


Simple closed 


I .inearRing 


Yes 


Yes 


Xon simple closed 


Non simple 


Yes 


Yes 


Xon linear interpolation 


Undefined 


No" 


No 


Zero length I.inc 


Simple ■? 


Yes 


No 


Zero length I.inc segment 


Simple ? 


Yes 


No 


Self crossing 


Non Simple 


Yes 


No 


Xon sequential segments overlay 


Non Simple 


Yes 


N'o 


Xon sequential points overlay 


Non simple 


Yes 


No 



' In ORACIJ'I Hi-Hpalial supported. In Hi-Hpalial a curve may consist fjut of sections. Each .section 
can have a different algorithm that determines its shape. 



4.4 The MultiCurve Data Type 

A MultiCurve is a one-dimensional Geometry Collection whose elements are Cur- 
ves. MultiCurve is a non-instantiable class in the OGC specification, it defines 
a set of methods for its subclasses and is included for reasons of extensibility. 
SDO and SDE support subclass MultiLineString. A MultiCurve is simple if and 



Table 5. Assertions for MultiCurve data type (subclass LineString) 



OGC’ Assertion 


SDO 


SDH 


.411 elements of the MultiCurve are Curves 


Yes 


Yes 


.4 MultiCurve is lopologieally closed 


Yes 


Yes 



only if all of its elements are simple and the only intersections between any two 
elements occur at points that are on the boundaries of both elements. 
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4.5 The Surface Data Type 

In planar two dimensional space a Surface is a planar two dimensional geometric 
object. 

A Surface is called simple by OGC when it consists out of a single ’patch’ that 
is associated with one ’exterior boundary’ and zero or more ’interior’ boundaries 
that do not cross. 

Polyhedral surfaces are formed by ’stitching’ together simple surfaces along 
their boundaries. SDO and SDE do not support polyhedral surfaces. 

The only instantiable subclass of Surface defined in the OGC specification 
is the Polygon. A Polygon is a planar surface, defined by one exterior boundary 
and zero or more interior boundaries. Each interior boundary defines a hole in 
the polygon. 



Table 6. Assertions for Surface data type (subclass Polygon) 



(KK' .Assertion 


SIX) 


SDK 


The boundarv’ of a polygon consists out of LinearRings 


Yes 


Yes 


A polygon may lia\'e zero or more intaior boundaries 


Yes 


Yes 


Polygons are topologieally closed 


Yes 


Yes 


No two Rings in the bouiidaiy' cross 


Yes 


Yes 


Exterior and interior botmdaries arc not connected 


? 


Yes 


Boimdarics of polygons do not cross 


Yes 


Yes 



Due to these assertions Polygons are considered simple. 

In SDO an additional assertion is made: 

1. A Polygon must at least be determined by three co-ordinates. 

In SDE the next additional assertions are made: 

1 . Co-ordinates of the exterior boundary must be in counter clockwise sequence, 

2. Co-ordinates of the interior boudaries must be in clockwise sequence, 

3. Multiple interior boundaries that touch are combined into one interior bo- 
undary, 

4. An interior boundary that touches the outer boundary is converted into an 
inversion of the outer boundary. 

The fact that polygons are composed out of LinearRings has impact on the 
geometry classification. The next set of special occurrences are considered: 
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Table 7. Occurrences of Surface data type (subclass Polygon) 



Occurrence 


0(iC 


SIX) 


SDK 


Zero length boundary segment 


Simple ? 


Yes 


No 


Zero perimeter 


Simple ? 


? 


No 


Zero area 


Simple ? 


? 


No 


Doundaiy cross 


Non simple 


No 


No 


Boundary segment overlap 


Non simple 


No 


No 


Boundary segments touch 


Non simple 


Yes 


No 


Toucliiiift islands 


Simple 


YAs 


No 


Islands overlap 


Non simple 


No 


No 



Table 8. Assertions for MultiSurface data type (subclass Polygon) 



OGC Assertion 


SDO 


SDK 


llie boundary of a MultiPolygon is a set of closed curxes 
couesponding to the boundaries of its element Polygons 


Yes 


Yes 


A MultiPolygon is defined as topologically closed 


Yes 


Y’es 


F,ach curve in the boundary of the MultiPolygon is in the boundary' 
of exactly one clement Polygon, and every cmvc in tlie boimdary of 
an element Polygon is in the boundary of the MultiPolygon. 


Yes 


Yes 


The interiore of two Polygons that are elements of a MultiPolygon 
do not intersect 


Y’es 


Yea 


llie Boundaries of any two Polygons that are elements of a 
MultiPolygon may not ‘cross’ and may touch at only a finite 
number of points 


Y"es 


Yea 



4.6 The MultiSurface Data Type 

A MultiSurface is a geometry whose elements are Surfaces. They are made up 
from Polygons. A MultiPolygon is a MultiSurface whose elements are Polygons. 

The assertions for MultiPolygon prevent topological overlap of polygons that 
are elements of the MultiPolygon. The SDO and SDE data model are identical 
in that respect. 
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5 Discussion 

The user is not interested in the way data are organised in the database. She/he 
only wants to store and retrieve. What is important to the user is that the system 
is capable of accommodating the users view of the world. In that respect it is 
important that the system is capable to accept the geometries as they are fed 
into the system. Unfortunately this is not always the case. 

From the previous paragraph it can be derived that basic differences in the 
data models exist due to the acceptance of sequential co-ordinates at the same 
location and the acceptance of Line and boundary segments to cross or overlap 
within the geometry. Additionally SDE modifies interior boundaries of a Polygon 
when they interact. Within the OGC simple feature class no overlap or cross is 
allowed. As will be shown in the next paragraph it will place several types of 
occurrences in the non simple feature class where exchange of data between 
systems is difficult. 

Regarding Point geometries there are no differences between OGC, SDO and 
SDE. 

Regarding Line type geometries the most important differences are: 

1. Zero length Line, 

2. Zero length Line segments, 

3. Self crossing, 

4. Line segment overlap. 

In the case of zero length Lines and zero length Line segments it should be 
noticed they these deflections from a simple geometry do not change the mor- 
phology of the geometry. From a data model view or storage volume perspective 
it may seem to be unnecessary to have two or more co- ordinates at the same 
location, but the system is not capable of determining why the co-ordinates 
are there. For example the co-ordinates or zero Line segment might be there as 
placeholders or foreign keys. 

Self crossing and segment overlap does not change the properties of the geo- 
metry in its self. Self crossing becomes a problem when determining what to do 
with it (whether it is a true intersection or not). There are no uniform rules on 
how to deal with it in a certain context. To avoid such problems self crossing 
and segment overlap is defined not to be allowed. 

Polygon boundaries are constructed using Lines and LineStrings. In all mo- 
dels the restrictions on these Lines and LineStrings also apply to Polygon. Spe- 
cifically related to Polygon the most important differences are: 

1. Zero area, 

2. Zero perimeter, 

3. Grossing and overlap of the internal and external boundaries, 

4. Interactions between internal and external boundary. 

“Zero area” and “zero perimeter” are related to the acceptance of co-ordinates 
at the same location and was discussed in the previous section on linear types. 
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“Crossing” and “overlap” are also related to the validation rules for Lines, but 
for Polygon there is the principle issue of validating boundary segments that do 
not determine the polygon characteristics. For example the common boundary 
segment of two internal boundaries. The common boundary has no relationship 
with the Polygon morphology or characteristics like perimeter and area. The 
question is whether this segment is a essential part of the geometry. In the next 
paragraph some examples are given. 

OGC has made effort to exactly as possible define simple geometries. All 
other instances of geometries are considered non simple. Rules on converting non 
simple to simple geometries are absent. This leaves room for software developers 
to implement their own package specific rules on converting non simple to simple 
geometries. Developers are free to implement their definition of “simple and non- 
simple” as a proprietary superset of what OGC as defined. 

6 Effects of Data Storage of Geometries 

In order to determine the morphology and characteristics of geometries, as they 
emerge in every day practice, a representative geodata set of large volume is 
visited. In this case a data set which is intensively used by the Dutch GIS 
community is the Dutch National Topographic Map (TOPlOvector) scale 1 : 
10,000. The data set contains over 15 million polygons, 75 million separate lines 
and 1.5 million points (Cattenstart, 1992). 

A program filter in conjunction with visual inspection was used to find specific 
occurrences of geometry morphologies. A set of morphologies was derived from 
the dataset and some are shown in Table 9. Based on the method of “strong 
inference” (Platt, 1964) these morphologies were inserted and tested in SDO 
and SDE whether they are valid or not. 

In Table 9 the morphologies of geometries that are tested are presented. In 
the SDO and SDE column the package reaction is denoted as: 

1. Accepted: The morphology is accepted as presented, 

2. Modified: The morphology receives a co-ordinate modification. 

This can be discarding, addition or movement, 

3. Converted: The morphology is split, joined or reclassified into an 

other (set of) geometry type(s), 

4. Rejected: The morphology was not accepted, 

5. Unknown: Due to imperfection of the checking routines the ac- 

ceptance is unknown. 

It is remarked here that in the case of modification not only the morphologies 
shape is changed but also related information might be lost. In SDE this might 
be a “Measure”, in SDO a foreign key to other data which is coupled to the 
co-ordinate column. In Table 9 in the validation columns a code is provided 
referring to the next set of remarks: 

0 = Zero length segment, 

1 = LineString or polygon boundary is self-intersecting. 
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2 = Polygon does not close properly, 

3 = The number of points is less than required for geometry, 

4 = Polygon patch has no area, 

5 = Co-ordinate discarded or added without loss of morphology. 



From Table 9 it is clear that there is a variety of occurrences of geometries. 
Related to the OGC data model both simple and non simple geometries were 
found. This leaves room for interpretation and geometric modification of inserted 
entities. The table shows evidence of this manipulation. 

Occurrence 1 shows overlap of Points and an SDO and SDE identical valida- 
tion. 

Occurrences 2 through 5 show variations of crossing, overlap and co-ordinates 
at the same location of lines. It shows how SDO and SDE deal with these geo- 
metries differently. 

Occurence 6 is a more principle problem. Line segments overlap here and 
therefor the line is classified non simple. However, although crossing and overlap 
are not allowed in SDE the line is accepted as is. 

Occurrence 7 shows how topological overlap is identically approached in the 
OGC, SDE and SDO data model. 

Occurrences 8 through 13 show variations of crossing and overlap of polygons 
and are all validated identically by SDO and SDE in accordance with the OGC 
data model. 

Occurrence 14 shows SDE merging both islands into one due to its data 
model rules. 

Occurrence 15 shows a principle difference between SDO and SDE. In SDO 
the common internal boundary is preserved, in SDE discarded. The question is 
if the common boundary is an essential part of the geometry. The OGC data 
model does not provide rules in this case. 

Occurrence 16 and 17 show although the data models are equal there are 
different interpretations leading to different results. 

Occurrence 18 shows that in all data models dangling Line segments in po- 
lygon boundaries are not accepted. 

Occurrence 19 shows an anomaly that is treated differently by SDO and SDE. 

Table 9 is evidence of the fact that the data models of OGC, SDE and SDO 
differ from each other with the result that the same dataset appears differently in 
SDO and SDE. To the user it is important to understand geometry modifications 
and decomposition necessities since a phenomenon or entity in the users view is 
recomposed. On improper retrieval the user may think that the stored entities 
are corrupted, or even worse, that they do not longer exist in the database. 

7 Conclusions and Future Developments 

The comparison between the OGC Data Model Architecture and the SDO and 
SDE reveals differences. The SDO and SDE data model are therefor not fully 
OpenGIS simple feature compliant. 
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Table 9. Special geometry morphologies and their validation in SDO and SDE. 





Geometry 

type 


Morphology 


Origin 


OGCtvue 


SDO valid 


SDEvalic 


1 


Point 


1 2 
X 


2 geometries at the same location 
Overlapping geometries. 


Multi 

point 


Accepted 


Accepted 


2 


Line 


1 2 
X 


Zero length line. 
Digitising error? 
Resolution etror? 


Single 


Accepted 


Rejected 
code; 0 


3 


Line 


2 

1 




3 

4 


Closed line. 


Single 


Accepted 


Accepted 


4 


Line 


,x. 


Crossing line with interior. 


Non 

single 


Accepted 


Accepted 


5 


Line 


1 2.3 4 5 


Zero length segment 
Digitising enor ? 


Single 


Accepted 


Modified 
code: 0 


S 


Line 


1 2 4 3 J 


Reverse direction 
Resolution enor? 


Non 

single 


Accepted 


Accepted 


7 


Polygon 


2 


X. 


3 

]" 

3 


Topological Overly 


Multi 

pol^^on 


Rejected 
code: 1 


Rejected 
code: 1 


8 


Polygon 


5 

1 


2 


4 

3 


Polygon with zero length 
segment. 


Single 


Accepted 


Modified 
code: 0 


? 


Polygon 


6 

1 


3 4 


1 ^ 
. 2 


Overlapping segments 


Non 

single 


Rejected 
code: 1 


Rejected 
code; 2 


10 


Polygon 


8 

1 




3 

2 


1; 


Boundary inward loop with no 
co-ordinate at the crossing 


Non 

single 


Rejected 
code: 1 


Rejected 
code; 1 


11 


Polygon 


9 




, 5 


Touching inside island. 


Non 

single 


Rejected 
code: 1 


Rejected 
code: 1 


1 




3 

2 


1 " 
\ 7 


12 


Polygon 


8 

1 


7 , 

7 


! 4 

□3 

2 


Boundary outward loop with no 
co-ordinate at the crossing 


Non 

single 


Rejected 
code: 1 


Rejected 
code: 1 


13 


Polygon 


~T 

1 


4 3 


7 

2 


Inward overl^ping segment. 


Non 

single 


Rejected 
code: 1 


Rejected 
code; 1 
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14 


Polygon 


12 




Touching islands 


Single 


Accepted 


Converte< 
Modified 
code: 5 


15 


Polygon 


4 

r 

4 

1 


E 


3’ 

2 




Islands sharing a segment 
boundary. 


Single 


Accepted 


Convertec 
Modified 
code; 5 


16 


Polygon 


1 


¥ 


r 

,3 
1 2 




Islands overlapping. 


Non 

single 


Rejected 
code: 1 


Rejected 
code: 1 


17 


Polygon 


8 

1 


-A 


3 4 
2 


Crossing outer boundary with co- 
ordinates at the crossing. 


Multi 

pol>^on 


Accepted 


Modified 
code; 5 


18 


Polygon 


6 

1 


¥ 


3 

2 


Outer boundary with ovedapping 
segments 


Non 

single 


Rejected 
code: 1 


Rejected 
code; 2 


19 


Polygon 


8 

5 , 
4 
1 




7 

6 

3 

2 


Outer boundary segments split 
polygon. 


Multi 

pol^^on 


Rejected 
code: 1 


Rejected 
code: 2 



SDO and SDE differ in their set of geometry validation rules and the way 
they interpret the OGC specification. Both systems will react differently when 
the same data are inserted. This will lead to problems when data from these 
packages are integrated. To the user this is an most unwanted situation and is a 
rationale for the need of OpenGIS. 

OGG has classified geometries into simple and non simple features. Rules 
for converting non simple to simple geometries are absent. This leaves room 
for software developers to implement their own concepts in dealing with certain 
types of geometries. This situation calls for a further extension of geometry 
definitions and rules to convert from non simple to simple entities in the OpenGIS 
Data Model Architecture. 

OGG has distinguished data types Point, Gurve and Surface. There is the 
notion of Node, defined as an intermediate location on a curve or surface bo- 
undary. A Node can have attributes. The OpenGIS Data Model Architecture 
is object based. Geometries are stored as independed entities. Normalised stor- 
age of geometries is thereby absent as well as intrinsic topological relationship. 
OGG is recommended to perform research on the notion of Node and intrinsic 
topological relationship. 

From a users point the positive message is that with proper care, spatial 
data, can be incorporated, managed and set available in both packages, following 
the rules of OpenGIS. Still there is much work to be done before the OGG 
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specification and its implementation in products like SDO and SDE will provide 
GIS products which are truly interoperable in character. 

Future development of OpenGIS might concentrate around more in depth 
definitions of simple and non simple geometries, rules upon conversion from 
non simple to simple geometries, research and rules upon extension of the data 
model with entity types like Node or user compiled combinations and standards 
on determining topological relationships. 

The link between the planar two dimensional data model and the three di- 
mensional data model will become stronger and non linear interpolation between 
co-ordinates in planar 2D might become of more importance in this respect. 

In all cases the user should be helped in her/his task to operate on data 
where the physical storage and representation schema of these data is totally 
transparent to her/him. However, the bumpy road to this goal leads through 
practical, scientific, technical and commercial interests. 
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Abstract. A prototype of a Geographic Database Integrator is under 
investigation and development. One of the long term goals of the Geogra- 
phic Database Integrator is to rednce the need for operator intervention 
in update operations between objects in different databases. This paper 
focnses on the research related to road network elements from two inde- 
pendently surveyed and maintained topographic databases, one at large 
scale and one at mid-scale. Central to the issue of update propagation is 
certainty of equivalence of different road object representations. There- 
fore, precise definitions of road segment and road junction are important. 
In both the large and mid-scale geographic data sets, the roads are area 
features, although the whole may be considered as one linear road net- 
work. In order to find the junctions and road segments, the constrained 
Delannay triangnlation is applied. Using these well defined elements, a 
strategy for Ending equivalent or corresponding road network elements 
has been developed. This forms the basis for processing and propagating 
updates in the road network from one database to the other database. 



1 Introduction: Scope, Context, and Related Work 

Geographic Database Integration is the process of establishing links between cor- 
responding objects in different, heterogeneous and autonomously produced da- 
tabases of a certain region [15]. The purpose of geographic database integration, 
in general, is to share geo-information between different sources. Sharing geo- 
information is a communication process. In communication the semantics or 
meaning of data is important and touches at the heart of interoperability. In 
this paper geographic database integration is being studied in the context of 
update propagation, that is the reuse of updates from one geographic database 
to another geographic database [20] . Geographic database integration gets more 
and more attention nowadays since the digitizing of traditional map series has 
ended. In these map series corresponding objects were only linked implicitly by 
a common reference system, the national grid [21]. In order to make these links 
more explicit, geo-science researchers and computer scientists have developed 
various strategies. In computer science schema integration has been the domi- 
nant methodology for database integration; see for example [11]. This approach 
has been extended for geographic databases; see [6] for an overview and see [2] 
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for a fine example. Geo-scientists on the other hand have adopted methods from 
communication theory like relational matching [21] and, in our case, from the 
field of AI [14] . In our approach the construction and use of an ontology for geo- 
graphic databases [9,18] makes it possible to inspect the result of the geographic 
database integration process for inconsistencies [13]. 

But despite these advances in geographic database integration methodologies, 
there is still a problem that can not be solved by these methodologies alone, that 
is the demarcation of homologous entities, suitable for update propagation espe- 
cially in the case of road networks; see Section 2. The remainder of this paper is 
organized as follows. Section 3 introduces the semantics of road networks, spe- 
cifically the definitions of road segments and road junctions. In Section 4 it is 
demonstrated how these road segments and road junctions are demarcated by 
a constrained Delaunay triangulation algorithm. Also the relationship with the 
road center lines (skeleton) is discussed in this section. A six step update propa- 
gation method for road elements is given in Section 5. Note that a road element 
is a road segment or a road junction. Finally, descriptions of the conclusions and 
future work end this paper. 




Fig. 1. The mid-scale topographic database TOPlOvector. 



2 The Demarcation of Homologous Entities 

In previous research the propagation of updates in building objects (e.g. houses, 
building blocks, garages, annexes, etc.) has been studied [20]. Building objects 
share the property that they are demarcated quite naturally. In contrast, road 
elements are sometimes demarcated in a haphazard way; see Fig. 1 and Fig. 2. 
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Fig. 2. The large-scale topographic database GBKN. 



Here road elements from the mid-scale topographic database correspond to se- 
veral different road elements from the large-scale topographic database. This 
n:m correspondence relationship is not suitable for pin-pointing an update from 
one database to another. So it is necessary to demarcate entities in such a way 
that 1:1 (or l:n or n:l) correspondence relationships can be established; that 
means the demarcation of homologous entities. 

3 Semantic Aspects of Road Networks 

In general people might observe road networks, as a collection of line segments 
and nodes (junctions at point locations) . In small scale geographic data sets roads 
are also represented as line features, but in large and mid-scale geographic data 
sets, the roads and junctions are represented as area features. In this research 
the GBKN (large scale base map, scale 1:1,000; see Fig. 2) and the TOPlOvector 
(mid-scale map, scale 1:10,000; see Fig. 1) are used. In both geographic data sets 
roads are represented as area features. Although the roads are area features, it is 
also useful to think about the linear network topology, because the road network 
is a complex whole. It is not possible to look at one piece of the road network 
and forget about the other parts. A single change could affect multiple related 
road segments and road junctions. 

As explained in the previous section, it is difficult to use complex road po- 
lygons for update propagation. Therefore, the road network is split in multiple 
elements. Before this can be done it is important to define these road elements 
in an unambiguous manner. The road element definition of the Nationaal Wt- 
genhestand (NWB) [4] is used, which adheres to the European CEN standard 
Geographic Data Files (GDF) [5]. The definitions cover road segments and road 
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junctions. A road segment is the reference unit on which users can put attribute 
information, road junctions are nodes which connect the segments. The NWB 
sees road segments as line features and road junctions as points. In the large 
and mid-scale geographic data sets the road segments and road junctions are 
area features, still it is tried to adhere the NWB as much as possible. The NWB 
declares that roads have to be divided into segments and junctions if three or 
more roads come together. Roads also have to be divided at places where the 
street name changes, or the maintenance is done by a different organization, or 
at the border of a village or a municipality. 

This research focuses on the geometric aspect and the roads are just divided 
at the junctions. The road network now consist of two types of area features: 
road segments and road junctions. The NWB has detailed criteria for defining 
the type of a junction. When two T-junctions are close together, you have to 
treat them as one junction when one of the extended boundaries of a road lies 
between the boundaries of the road on the other side of the junction; see Fig. 3. 
Otherwise you get two independent T-junctions; see Fig. 4. 



■ I I 




Fig. 3. These ’T-junctions’ result in 
one road junction area and four road 
segments. 



Fig. 4. These T-junctions result in two 
road junction areas and five road seg- 
ments. 



It is not possible to use GBKN information directly to update the TOPIO- 
vector. Generalization and aggregation play a role in converting GBKN upda- 
tes into TOPlOvector updates; see Figs 1 and 2. In the GBKN speed ramps 
and small roundabouts are represented as small area features. The same is true 
for sidewalks and parking strips. None of these objects are represented in the 
TOPlOvector. Updates in these small objects may not have any influence alone, 
but several small updates together could create an update which might have 
enough relevance to be propagated. 

4 Geometric Aspects of Road Networks 

The method used to find the junctions in the road network is based on trian- 
gulating the road area. This triangulation is used to compute a skeleton of the 
road. The nodes in the road skeleton define the location of the junctions and 
the edges of the surrounding triangle are used to separate the road network in 
road segment areas and road junction areas. The constrained Delaunay triangu- 
lation (GDT) algorithm described in [17] is used to compute the triangulation. 
A GDT over a planar set of n vertices together with a set of non-intersecting 
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edges has the properties that all specified vertices and edges can also be found in 
the output and the result is as close as possible to the unconstrained Delaunay 
triangulation [8]. That is the circumcircle of any triangle has no vertex inside 
unless the vertex is at the other side of a constraining edge. In the case of the 
road network the set of separate input vertices is empty and the set of edges 
consists of the road boundary edges. 

4.1 Constrained Delaunay Triangulation 

The applied algorithm runs in O(nlogn) time, which is asymptotically optimal 
[10]. The algorithm is based on the concept of two other algorithms. The first 
algorithm is the unconstrained Delaunay triangulation (UDT) algorithm of Lee 
and Schachter [7] and the second algorithm is the CDT algorithm of Chew [1]. 
More details about the algorithm and its implementation can be found in [16]. 

In general the input of the CDT algorithm is a graph G = (V,E) in which 
C is a set of n vertices (separate points and the end points of the input edges) 
and if is a set of edges, the so called G-edges. Two different kinds of edges 
appear in a CDT: G-edges, already present in the graph, and D-edges, created 
by the CDT algorithm. If the graph has no G-edges then the CDT and the UDT 
(unconstrained Delaunay triangulation) are the same. The applied algorithm is 
based on the divide- and- conquer paradigm. The graph can be thought of to be 
contained in an enclosing rectangle (the domain). This rectangle is subdivided 
into n separate vertical strips in such a way that each strip contains exactly 
one region (a part of the strip) which in turn contains exactly one vertex. After 
dividing the graph into n initial strips, adjacent strips are pasted together in 
pairs to form new strips. During this pasting new regions are formed of existing 
regions for which the combined CDTs are calculated. This pasting of adjacent 
strips is repeated following the divide-and-conquer paradigm until eventually 
exactly one big strip, consisting of exactly one big region, is left for which the 
CDT is calculated. 

4.2 Interpreting Triangles of the Road Network 

The method used to derive the skeleton from the CDT is based on Wilschut 
et al. [22]. In the triangulation four different types of triangles cover the road 
area based on the number of G-edges in the boundary of the triangle: 0-triangle, 
1-triangle, 2-triangle, and 3-triangle. The 3-triangle is an exception and does 
only occur when there is a non-connected road area with triangular shape in the 
input data set. A junction can be found by a triangle which has only D-edges 
and no G-edges, that is a 0-triangle; see the light triangles in Figs 5 and 6. A 
T-junction is a single 0-triangle with no neighbor 0-triangles. A normal crossing 
(4-way junction) is defined by 2 adjacent 0-triangles; e.g. the junctions at the 
bottom center in Fig. 6. In general a n-way junction is defined by (n-2) adjacent 
0-triangles. 

The 1-triangles form the building blocks of connecting road segments. Finally, 
the 2-triangles define the end points of the road network, that is, the dead-end 




182 



H. Uitermark, A. Vogels, and P. van Oosterom 





Fig. 5. Triangulated road network 



Fig. 6. Another road network 



streets. Also, a small lump in the boundary of a road segment may result in 2- 
triangle and therefore in a small dead-end street; see the top right road segment 
in Fig. 6. A solution for this ’problem’ is to (virtually) remove the 2-triangle 
smaller than a certain threshold area adjacent to a 0-triangle. In such a case, also 
the original 0-triangle becomes a (virtual) 1-triangle, that is, part of a connecting 
road segment and not a junction. Further, due to the fine distribution of vertices 
on the road boundary, the two 0-triangles defining a 4-way junction may not be 
topologically adjacent. In this situation another approach is needed to couple 
the two 0-triangles for one 4-way junction: if the distance between the center 
points of the two 0-triangles is less than a certain threshold, then the 0-triangles 
can be coupled. Note that this is also true for n-way junctions in general. 

The separation (demarcation) between road segments and road junctions is 
defined by the edges of 0-triangles and leaving out the shared edges of 0-triangles 
in case of n-way junction with n > 3. A last point of attention is the location of 
the junction (point) and the separation edges. In case of the T-junction, there 
may not be a close vertex at the road boundary at the other side of the road; 
see Fig. 7: in the top horizontal road, the junction is about 30 meters to the east 
of the actual location of the junction. This may be solved by adding additional 
(intermediate) vertices within long road boundary edges; whenever they are lon- 
ger than a certain maximum length (e.g. the average width of a road, about 15 
meters); see Fig. 8. Note that adding additional vertices will increase the com- 
puting resources (memory and time) during the triangulation and subsequent 
processes. First applying line generalization may reduce the required computing 
resources during the triangulation. It has also the advantages that it removes 
some virtual 2-triangles and that close, but no direct neighbor 0-triangles, may 
become direct neighbor 0-triangles (beneficial for finding 4- and higher way road 
junctions); see Fig. 9. 
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Fig. 7 . Road junction with displaced 0- Fig. 8. Add more intermediate points: 
triangle (center node). better 0-triangles. 



4.3 Road Center Lines 

Once the 0-, 1-, and 2-triangles are obtained it is not only possible to find the 
road junctions and dead-end roads, but it is also relatively easy to find a skeleton 
of the road. That is, the corresponding linear network based on the road center 
lines. The construction of the skeleton is based on following the middle of the 
internal edges of the 1-triangles. In a 0-triangle the center of the triangle is 
connected to the middle of all three edges. In a 2-triangle the middle of the 
D-edge is connected to the common point between the two G-edges, that is, the 
end-point of the road. Finally, this method needs some post processing in order 
to remove the ’dip’ in a straight road in T-junction and also to make one center 
point of 4-way junction instead of two connected center points of T-junctions. 

A general method for solving these problems is described in detail by Gao 
and Minami [3]. Their method is based on looking at the trend-lines of the parts 
of the center lines within a certain radius around the node (or average of a group 
of nodes within radius distance of each other) . A pair of trend-lines with nearly 
the same angle is replaced by a straight line connecting the two corresponding 
center lines. In case there are more pairs of trend-lines with nearly the same 
angle, then the intersection(s) of the corresponding straight line connections is 
(are) computed. The other center lines are connected through their trend-line to 
the straight line. The location of the junction node is the (average) intersection 
point, to which all center lines are connected. 

The method, described in this paper, to obtain center lines is a vector based 
approach. It assumes a topologically correct input of the road boundaries. In case 
the vector data is inaccurate, a raster based approach may be more appropriate. 
A description of this method is given by Thomas [12], in which the raster is 
represented by a compact run-length encoded binary image. 
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Fig. 9. Apply line generalization: less fake 0-triangles 



5 Update Propagation 



As explained before road update propagation is different from building update 
propagation. The most important difference is that buildings are usually unatta- 
ched to other objects, whereas road segments are connected to road junctions 
and other road segments; it is a complex whole even after the road polygons are 
divided into road segments and road junctions. To reduce the complexity and 
find the 1:1 (or l:n or n:l) correspondences in order to propagate the relevant 
updates we propose the following six steps: 

— Step 1: Synchronization of GBKN and TOPlOvector to the same moment in 
time. In general every 6 to 12 months updates in the field are measured and 
used to update the GBKN. On the other hand the TOPlOvector is updated 
every four years. Before updates can be propagated to the other database, 
the two databases have to be synchronized. Synchronization means rolling 
back the GBKN in time until its date is the same as the TOPlOvector date. 
This is possible because every object in the GBKN database has two time 
stamps: Tmin and Tmax[19]. Tmin is the date an object has been added 
to the database. Tmax is the date an object has been replaced by one or 
more other objects. These ’old’ objects remain in the database, but are not 
’valid’. If you bring back the database in time, the old objects become valid 
and represent the desired moment in the past. 

— Step 2: Create road segments and road junctions areas in the GBKN and the 
TOPlOvector. This step is based on the constrained Delaunay triangulation 
and is explained in the previous section. 

— Step 3: Find corresponding road segments and the road junctions areas bet- 
ween GBKN and TOPlOvector. The correspondences between road elements 
is found by computing the overlap between the elements from both the 
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GBKN and TOPlOvector databases; the method is explained in [20] and 
is similar to finding correspondences between building updates. 




Fig. 10. Overview of the up- 
date propagation steps 
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Fig. 11. Changes in the road network. 



Before it is possible to propagate updates, line feature updates (as a result of 
a change in the road boundary edges) have to be transformed into area feature 
updates. This has to be done because surveyors in the field will always collect 
(point and) line measurements. Attention has to be paid to which lines have 
been used for creating road objects, which areas are effected by deleted, changed 
or new lines. We also have to find an answer to the questions when and how 
GBKN road updates should and could be propagated into the TOPlOvector? 
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— Step 4-’ Decide if an update is important enough to he propagated to TOPIO- 
vector. Determine whether the update affects the TOPlOvector objects while 
respecting generalization and aggregation rules and relevance of an object 
for the TOPlOvector. For example, in Fig. 11 the new road element G12 is 
relevant, but the new road element Gil is not relevant because it is too 
small. 

— Step 5: Transform GBKN updates into TOPlOvectors updates. The object 
definitions are not the same. For example, in the TOPlOvector a ditch be- 
longs to the road, but this is not the case in the GBKN. Before changing an 
update it is helpful to analyze the relationship between the objects in both 
geographic databases. We analyze correspondences at geometric, attribute 
and semantic levels to understand if and how database integration could be 
possible. Making it possible to find a mechanism to change GBKN updates 
into TOPlOvector updates. For example, in Fig. 11 the GBKN road element 
G12 is first generalized and its classification is adjusted, and it then becomes 
TOPlOvector road element T12. 

— Step 6: Propagate the updates into the TOPlOvector. An update has changed 
an original object. If that update has to be propagated to TOPlOvector, it 
has to be propagated to the TOPlOvector object, which corresponds with the 
GBKN object, which has been updated. It is not just possible to propagate 
the new object and remove the old version just like propagation buildings. 
If there is a new road, it is also necessary to connect the new road to the 
existing road segments, road junctions and to other nearby areas. That is, 
it has to be fitted into the TOPlOvector topology structure. For example, 
in Fig. 11 the GBKN road element G12 is connected to road elements G5 
and G7, in the same manner T12 has to be connected to T5 and T7 in the 
TOPlOvector. 

Fig. 11 shows an example area with changes in the road network. In this figure 
only the roads are shown. One old road disappears, one new road appears. 



6 Conclusion and Future Work 

In this paper the importance of well defined road network elements is argued 
for the purpose of geographic database integration in general and update pro- 
pagation in specific. It was shown that the constrained Delaunay triangulation 
gives a good basis for the demarcation of road segments and road junctions. 
However, a few refinements are required such as removing small lump (or very 
small dead end street) and grouping neighbor or very close T-junctions to one 
n-way junction. 

Future work consists of experimenting with the update propagation steps 
described in the previous section with real data sets. Further, investigation into 
other ’linear’ feature types, such as railroads and water ways, is planned. The 
question is whether the same method for defining objects is valid for these feature 
types. Finally, we have to consider update propagation, and geographic database 
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integration in general, not only on a ’feature by feature’ basis, but in a more 
integrated way. Very often the different feature types are embedded in the same 
planar topology structure and do heavily influence each other. 
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Abstract. The assessment of semantic similarity among objects is a 
basic requirement for semantic interoperability. This paper presents an 
innovative approach to semantic similarity assessment by combining the 
advantages of two different strategies: feature-matching process and se- 
mantic distance calculation. The model involves a knowledge base of spa- 
tial concepts that consists of semantic relations (is-a and part-whole) and 
distinguishing features (functions, parts, and attributes). By taking into 
consideration cognitive properties of similarity assessments, this model 
represents a cognitively plausible and computationally achievable me- 
thod for measuring the degree of interoperability. 



1 Introduction 

Since the first studies on interoperability, progress has been made concerning 
syntactic interoperability, i.e., data types and formats, and structural interope- 
rability, i.e., schematic integration, query languages, and interfaces (Sheth 1998). 
As current information systems increasingly confront information and knowledge 
issues, semantic interoperability becomes a major challenge for the next genera- 
tion of interoperating information systems. 

In information systems, semantics relates the content and representation of 
information to the entities or concepts in the world (Meersman 1997). The pro- 
blem of semantic interoperability is the identification of semantically similar 
objects that belong to different databases and the resolution of their schematic 
differences (Kashyap and Sheth 1996). Schematic heterogeneity can only exist, 
and therefore be solved, for semantically similar objects (Bishr 1997). Studies 
have suggested the use of an ontology (Guarino and Giaretta 1995) as a frame- 
work for semantic similarity detection (Bishr 1997, Kashyap and Sheth 1998). 
On the one hand, a possible approach is to create a knowledge base in terms of 
a common ontology, upon which it is possible to detect semantic similarities and 
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to define a mapping process between concepts (Lenat and Guha 1990, Kahng 
and McLeod 1998). On the other hand, we can expect that in a realistic scenario 
new concepts will be added to or eliminated from the ontology. There may be 
different ways to classify a concept based on the specific application and the 
degree of detail of the concept’s definition. Hence, the reuse and integration of 
existing domain specific ontologies becomes necessary (Kashyap and Sheth 1998, 
Mena et al. 1998). 

This paper presents a computational model for similarity assessment among 
entity classes. We use the term entity classes to describe concepts in the real 
world and to distinguish their semantics from the semantics of data modeled and 
represented in a database. By concepts in the real world, we mean the cognitive 
representation which a person uses to recognize and categorize objects or events 
(Dahlgren 1988). Naturally, achieving semantic representation of objects in a 
database implies a good understanding of the semantics of the corresponding 
concepts in the real world. Consequently, our work considers studies done by 
cognitive scientists in the area of knowledge and behavior as well as by computer 
scientists in the domain of artificial intelligence. 

The similarity model assumes a common ontology that includes the real world 
concepts’ distinguishing features and interrelationships. A feature-matching pro- 
cess, together with a semantic distance computation, provides a strategy to 
create a model that satisfies cognitive properties of similarity assessment. In 
particular, we capture the idea that similarity assessment is not always a sym- 
metric evaluation: similarity is a result of the commonalities and differences 
between two concepts, and the relevance of the distinguishing features (func- 
tions, parts, and attributes) may differ from one to another. In addition, is-a 
relations are complemented with part-whole relations to create an ontology that 
better reflects the interrelationships between concepts. 

We focus on the domain of spatial information and we combine two existing 
sources of information, WordNet (Miller 1995) and the Spatial Data Transfer 
Standard (USGS 1998), to create a common ontology that is used for the de- 
velopment of a prototype. The scope of this study includes only the evaluation 
of similarity within this common ontology. The analysis of how to integrate two 
domain specific ontologies is left for future work. 

The remainder of the paper is organized as follows. Section 2 reviews dif- 
ferent approaches to the evaluation of semantic similarity. Section 3 describes 
the components of the definition of entity classes. In Section 4 we present our 
similarity model, and we illustrate its use with an example in Section 5. Finally, 
conclusions and future work are presented in Section 6. 



2 Methods for Comparing Semantics 

Most of the models proposed by psychologists are feature-based approaches, 
which use features that characterize entities or concepts (for example, properties 
and role). Using set theory, Tverski (1977) defined a similarity measure as a 
feature-matching process. It produces a similarity value that is not only the 




Assessing Semantic Similarities among Geospatial Feature Class Definitions 



191 



result of common features, but also the result of the differences between two 
entity classes. A different strategy for feature-based models is to determine a 
semantic distance between concepts as their Euclidean distance in a semantic, 
multidimensional space (Rips et al. 1973). This approach describes similarity by a 
monotonic function of the interpoint distance within a multidimensional space, 
where the axes in this space describe features of concepts. Krumhansl (1978) 
introduced the distance-density model based on a distance function for similarity 
assessment that complements the interpoint distance with the density of the 
space. This model assumes that, within dense regions of a given stimulus range, 
finer discriminations are made than within relatively less dense subregions. 

A shared disadvantage of feature-based models is that two entities are seen 
to be similar if they have common features; however, it may be argued that the 
extent to which a concept possesses or is associated with a feature may be a 
matter of a degree (Krumhansl 1978). Consequently, a specific feature can be 
more important to the meaning of an entity class than another. On the other 
hand, the consideration of common features between entity classes seems to 
match the way people assess similarity. 

With a different approach, computer scientists have defined similarity measu- 
res whose basic strategies make use of the semantic relations between concepts. 
These semantic relations are typically organized in a semantic network (Collins 
and Quillian 1969) according to which the links between nodes denote concepts. 
The semantic distance results in an intuitive and direct way of evaluating si- 
milarity in a hierarchical semantic network. For a semantic network with only 
is-a relations, Rada et al. (1989) pointed out that the semantic relatedness and 
semantic distance are equivalent and we can use the latter as a measure of the 
former. They defined conceptual distance as the length of the shortest path bet- 
ween two nodes in the semantic network. This distance function satisfies metric 
properties of minimality, symmetry, and triangle inequality. 

Although the semantic distance models have been supported by a number of 
experiments and have shown to be well suited for a specific domain, they have 
the disadvantage of being highly sensitive to the predefined semantic-network ar- 
chitecture. In a realistic scenario, adjacent nodes are not necessarily equidistant. 
Irregular density often results in unexpected conceptual distance measures. Most 
concepts in the middle to high sections of the hierarchical network, being spati- 
ally close to each other, would therefore be deemed to be conceptually similar to 
each other. In order to account for the underlying architecture of the semantic 
network, Lee et al. (1993) argued that the semantic distance model should allow 
for weighted indexing schema and variable edge weights. To determine weights 
the structural characteristics of the semantic network are typically considered, 
such as the local density network, the depth of a node in a hierarchy, the type 
of link, and the strength of an edge link. 

Some studies have considered weighted distance in a semantic network. Ri- 
chardson and Smeaton (1996) used a hierarchical concept graph (HCG) derived 
from WordNet (Miller 1995) to determine similarity. They defined weights of 
links in a semantic network by the density of the HCG, estimated as the number 
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of links, and by the link strength, estimated as a function of a node’s information 
content. Likewise Jiang and Conrath (1997) proposed the use of information 
content to determine the link strength of an edge. The information content of a 
node is obtained from the statistical analysis of word frequency occurrences in 
a corpus. The general idea of the information content is that, as the probability 
of occurrence of a concept in a corpus increases, informativeness decreases, such 
that the more abstract a concept, the lower its information content. 

Richardson and Smeaton (1996) and Richardson et al. (1994) used a hierar- 
chical network and information theory to propose an information-based model 
of similarity. Their approach to modeling semantic similarity makes use of the 
information content as described above, but it does not include distance as a 
basic strategy for similarity assessment. Conceptual similarity is considered in 
terms of class similarity. The similarity between two classes is approximated by 
the information content of the first superclass in the hierarchy that subsumes 
both classes. In the case of multiple inheritance (Cardelli 1984), similarity can 
be determined by the best similarity value among all various senses the clas- 
ses belong to. The information-content model requires less information on the 
detailed structure of the network. On the other hand, many polysemous words 
and multi-worded classes will have an exaggerated information content value. 
The information-content model can generate a coarse result for the comparison 
of concepts, because it does not differentiate the similarity values of any pair of 
concepts in a sub-hierarchy as long as their “smallest common denominator is 
the same (Jiang and Conrath 1997). 

In the cognitive-linguistics domain. Miller and Charles (1991) discussed a 
contextual approach to semantic similarity. They developed a measure for si- 
milarity that is defined in terms of the degree of substitutability of words in 
sentences. For words from the same syntactic category and the same domain, 
the more often it is possible to substitute one word by another within the same 
context, the more similar the words are. The problem with this similarity mea- 
sure is that it is difficult to define a systematic way to calculate it. 

Based on our analysis of current models for semantic similarity, we propose 
a combination of the features-matching process and the evaluation of semantic 
distance. We expect that this interpreted model will provide a similarity measure 
that is not only cognitively plausible, but also computationally achievable. 

3 Components of Entity Class Definitions 

Important components of the entity class definitions are the semantic relations 
among classes. We select a specific domain, spatial information systems, and 
describe the set of entity classes and their semantic relations as an ontology. In 
artificial intelligence, the term ontology has been used in different ways. Ontology 
has been defined as a “specification of a conceptualization” (Gruber 1995) and as 
a “logical theory which gives an explicit, partial account of a conceptualization” 
(Guarino and Giaretta 1995). Thus, an ontology is a kind of knowledge base 
that has an underlying conceptualization. For our purpose, an ontology will 
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be used as a body of knowledge that defines (1) primitive symbols used in the 
representation of meaning and (2) a rich system of semantic relations connecting 
those symbols. 

The most common semantic relation used in an ontology is the is-a relation, 
also called the hypernymic or superordinate relation. This relation goes from a 
specific to a more general concept that resembles the generalization mechanism 
of object-oriented theory (Dittrich 1986). The is-a relation is a transitive and 
asymmetric relation that defines a hierarchical structure, where terms inherit all 
the characteristics of their superordinate terms. 

Mereology, the study of part- whole relations (Guarino 1995), plays another 
important role for ontology. Studies have usually assumed that part-whole relati- 
ons are transitive such that if a is part of b and& is part of c, then a is part of c as 
well. Linguists, however, have expressed concerns about this assumption (Cruse 
1979, Iris et al. 1988). Explanations of the transitive problem rely on the idea 
that part-whole relations are not one type of relation, but a family of relations. 
Winston et al. (1987) defined six types of part-whole relations: component-object 
(e.g., pedal-bike), member-collection (e.g., tree- forest), portion-mass (e.g., slice- 
cake), stuff-object (e.g., steel-bicycle), feature-activity (e.g., paying-shopping), 
and place-area (e.g., oasis-desert). Chaffin and Herrmann (1988) extended the 
previous classification with a seventh meronymic relation, phase-process (e.g., 
adolescence-growing up). For this work, we only consider the component-object 
relation with the properties of asymmetry and (with some reservations) transi- 
tivity. 

When defining entity classes, the part-whole converse relations do not always 
hold. For example, we can say that a building complex has buildings, i.e., building 
complex is the whole for a set of buildings; however, buildings are not always 
part of a building complex. Thus, we distinguish the two relations, “part-of ’ and 
“whole-of,” to be able to account for such cases. 

Although the general organization of the entity classes is given by their se- 
mantic relations, this information is not enough to distinguish one class from 
another. For example, a hospital and an apartment building have a common 
superclass building; however, this information is insufficient to differentiate a 
hospital from an apartment building. Considering that entity classes correspond 
to nouns in linguistic terms, we borrow Miller’s (1990) description of nouns and 
propose to assign what he called distinguishing features to each class. Distin- 
guishing features include parts, functions, and attributes. 

Parts are structural elements of a class, such as roof and floor of a building. 
We could make a further distinction between “things” that a class must have 
(“mandatory”) or can have (“optional”). Note that parts are related to the 
relation part-whole previously discussed. While the relation part-whole works at 
the level of entity-class definitions and forces us to define all the entity classes 
involved, part features can have items that are not always defined as entity 
classes in our model. Function features are intended to represent what is done to 
or with a class. For example, the function of a college is to educate. Thus, function 
features can be related to other terms such as affordances (Gibson 1979) and 
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behavior (Khoshafian and Abnous 1990). Attributes correspond to additional 
characteristics of a class that are not considered by either the set of parts or the 
set of functions. For example, some of the attributes of a building are age, user 
type, owner type, and architectural properties. Using a lexical categorization, 
parts are given by nouns, functions by verbs, and attributes by nouns whose 
associated values are given by adjectives or other nouns. 

In addition to semantic relations and distinguishing features, two more lin- 
guistic concepts are taken into consideration for the definition of entity classes. 
Entity classes are associated with concepts represented in natural language by 
words. Natural language understanding distinguishes two properties of the map- 
ping between words and meanings, polysemy and synonymy. Polysemy arises 
when the same word may have more than one meaning, different senses. Syno- 
nymy corresponds to the case where two different words have the same meaning 
(Miller et al. 1990). Our entity-class definition incorporates synonyms, such as 
parking lot and parking area, and different senses of entity classes, such as the 
case when a hank could be an elevation of the seafloor, a sloping margin of a 
river, an institution, or a building. 



4 A Computational Method for Assessing Similarities of 
Entity Classes 

We introduce a computational model that assesses similarity by combining a 
feature-matching process with a semantic distance measurement. While our mo- 
del uses the number of common and different features between two entity classes, 
it defines the relevance of the different features in terms of distance in a semantic 
network. 

For each type of distinguishing features (i.e., parts, functions, and attributes) 
we propose to use a similarity function St{c\,C 2 ) (Equation 1) that is based on 
the ratio model of a feature-matching process (Tversky 1977). In St{c\,C 2 ), c\ 
and C 2 are two entities classes, t symbolizes the type of features, and C\ and 
C 2 are the respective sets of features of type t for c\ and C 2 . The matching 
process determines the cardinality (#) of the set intersection {C\ fl C 2 ) and the 
set difference (Ci — C 2 ), defined as the set of all elements that belong to C\ but 
not to C 2 . 



*S't(ci,C2) 

dt{Ci,C2) 



{Cl n C2}# 

St{Ci,C2) 

[Cl n C2}# + a[Ci - C2}# + (1 - a){C2 - Cl}# 



( 1 ) 



This similarity function yields values between 0 and 1. The extreme value 1 
represents the case when everything is common between two entity classes, or 
when the non-common features between two entity classes do not affect the 
similarity value (i.e., the coefficient of the non-common features is zero). The 
value 0, in constrast, occurs when everything is different between two entity 
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classes. The weight a is determined as a function of the distance between the 
entity classes (ci and C 2 ) and the immediate superclass that subsumes both 
classes. This corresponds to the least upper bound (l.u.b.) between two entity 
classes in partially ordered sets (Birkhoff 1967). When one of the concepts is the 
superclass of the other, the former is also considered the immediate superclass 
(l.u.b.) between them. The distance of each entity class to the l.u.b. is normalized 
by the total distance between the two classes, such that we obtain values in the 
range of 0 to 1. Then, to obtain the final values of a, we define an asymmetric 
function (Equation 2). 



a(ci, C 2 ) 



d{ci, l.u.b.) < d{c 2 , l.u.b.) 
1 d{ci, l.u.b.) > d{c 2 , l.u.b.) 



(2) 



The assumption behind the determination of a is that similarity is not neces- 
sarily a symmetric relation (Tversky 1977). For example, “a hospital is similar 
to a building” is a more general agreement than “a building is similar to a hos- 
pital.” It has been suggested that the perceived distance from the prototype to 
the variant is greater than the perceived distance from the variant to the pro- 
totype, and that the prototype is commonly used as a second argument of the 
evaluation of similarity (Rosch and Mervis 1975, Krumhansl 1978). Hence, we 
assume that the non-common features of the concept used as a reference (the 
second argument) should be more relevant in the evaluation. 

An interesting case occurs when comparing a class with its superclass or vice 
versa. Since subclasses inherit all features of their superclasses, only subclasses 
may have non-common features. It can be easily seen that when comparing 
a class with its superclass or vice versa, the weight associated with the non- 
common features of the first argument is 0 (a) and the weight for the non- 
common features of the second argument is 1 (1— a) . By considering the direction 
of the similarity evaluation, a class will be more similar to its superclass than 
the same superclass to the class. Currently and for the purpose of calculating 
the weight a, the part-of relation is treated like the is- as relation. The difference 
of these two relations depends upon the inheritance property of the is-a relation. 
The effect of the part-of relation can be illustrated when comparing a building 
with a building complex or vice versa. With our model, a stronger similarity is 
found between the building and the building complex than between the building 
complex and the building. Note, however, that the similarity between the whole 
and its parts could also be higher, since there is not an inheritance property for 
this semantic relations that forces the parts to have all the features of the whole. 

When searching for an entity class, synonyms are incorporated at the be- 
ginning of an evaluation of similarity. In addition, synonyms are also taken into 
account in the matching process of parts, functions, and attributes. Each term 
(entity class, part, function, or attribute) is treated in the same way as its sy- 
nonyms. Words with different semantics or senses (polysemy) are also included. 
We handle different senses of entity class as independent entity classes with a 
common name. For parts, functions, and attributes, we first match the senses of 
the terms and then we evaluate the set-intersection or set-difference operation 




196 M.A. Rodriguez, M.J. Egenhofer, and R.D. Rugg 



among the set of features. A term in one sense might have a set of synonyms, 
therefore, we match terms or their synonyms that belong to the same sense. For 
example, the verb “to play” has two different senses in our database, play for 
recreation and play for competition. For any entity class that has the function 
“to play,” the knowledge base also includes the sense of the word such that the 
system can find the synonyms of “to play” for the respective sense. 

The global similarity function S'(ci,C 2 ) is a weighted sum of the similarity 
values for parts, functions, and attributes (Equation 3), where tOp, w/, and u>a are 
weights of the similarity values for parts, functions, and attributes, respectively. 
These weights define the importance of parts, functions, and attributes that 
might vary among different contexts. The weights all together must add up to 1. 

S'(ci,C 2 ) = UJp ■ Sp{Ci,C 2 ) +UJt ■ St(ci,C 2 ) + U)a ■ S'a(ci,C 2 ) (3) 

5 An Example 

We have implemented a software prototype for the similarity assessment. It 
used WordNet (Miller 1995) and the Spatial Data Transfer Standard (SDTS) 
(USGS 1998) to derive a knowledge base. From SDTS we extracted the entity 
classes to be defined, their partial definition of is-a relations, and the attributes 
for entity types. By using WordNet we complemented the is-a relations with 
the part- whole relations and we obtained the structural elements (parts) of 
entity types. Finally, functions were derived from verbs explicitly used in the 
description of entity classes, augmented by common sense. 

To illustrate the use of our model for interoperability, consider an urban- 
planning application that deals with the rehabilitation of the downtown of a 
city. To accomplish the goal, planners have decided to analyze and compare the 
downtowns of cities of similar sizes that are considered high quality examples 
of urban life. In the first instance, planners are concerned about the functio- 
nal components of the downtown, i.e., entity classes, and they have left for a 
posteriori analysis the geometric distribution of these components. 

Maps of each downtown are obtained from different spatial databases and 
we face the problem of comparing the semantics of entity classes. For the time 
being, we assume that maps are based on a common ontology because they were 
created by using the same conceptualization. Although the assumption of a uni- 
que ontology simplifies the problem of interoperability, different classifications 
within the same ontology remain possible. For example, what was identified as a 
sidewalk on one map could be identified as a path in another one using different 
criteria. This type of problem resembles the abstract level incompatibility discus- 
sed by Kashyap and Sheth (1996) when describing the schematic heterogeneities 
in multidatabases. 

Our approach to accomplish the planners’ objective is to evaluate the se- 
mantic similarity by searching for the best match, entity-to-entity, between two 
downtown maps. A portion of the knowledge base used for this application, re- 
presenting an ontology with only is-a relations, is shown in Figure 1. Entities 
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that represent cases of polysemy (i.e., belonging to different entity classes with 
same name but multiple meanings) and entities that belong to classes with mul- 
tiple superclasses (i.e., an entity class with multiple inheritance) are highlighted. 
Figure 2 shows the complete description of an entity class, i.e., its distinguishing 
features and its semantic relations. 
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Fig. 1. Entity class hierarchy (is-a relations). 



Since the planners in our example are mostly concerned with the functional 
components of the downtowns, they may assign a higher weight to the function 
features: for example, 50for attribute features. For this application, the direction 
of the evaluation is determined by the target downtown (the downtown to be 
redesigned) against which the ideal downtowns are compared. 

Figure 3 shows a similarity assessment between a stadium and all other pos- 
sible entity classes in the knowledge base. The similarity evaluation (Equation 1) 
is performed in two steps. Firstly, the set-union and set- difference operations 
are determined between the set of features of stadium and the set of features 
of each entity class in the knowledge base. For example, stadium has five parts, 
four functions, and six attributes. Taking arena as an example of one of the 
entity classes to compare against stadium, arena has eight parts, four functions. 
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Fig. 2. Distinguishing features and semantic relations of an entity class. 



and ten attributes. The set of common features between stadium and arena 
includes four parts, four functions, and four attributes. Secondly, the weight ( 
(Equation 2) is determined based on the is-a relation and part-whole relations 
between stadium and the rest of the entity classes. Considering the comparison 
between stadium and arena, the common superclass between these two entities 
is eonstruction, leading to a weight ( equal to 0.33 when the direction of the 
evaluation goes from stadium to arena. This value of ( reflects the fact that for 
our knowledge base stadium is a more general concept than the concept of arena. 
Numerically, the similarity assessment between stadium and four entity classes 
results in values that are greater than or equal to 0.5: arena (0.78), athletic 
field (0.62), tennis court (0.6), and construction (0.5). 

The similarity evaluation between athletic field and all other entity classes 
illustrates the asymmetric evaluation of the similarity model. For a symmetric 
evaluation, the similarity between athletic field and stadium should be the same 
as the similarity between stadium and athletic field (0.62). The similarity bet- 
ween athletic field and stadium, however, is 0.58. 
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Fig. 3. Similarity assessment: stadium against all entity classes. 



6 Conclusions and Future Work 

Our model of semantic similarity has a strong basis in linguistics. It introduces 
synonyms and different senses in the use of terms. It also provides a first ap- 
proach to handle part-whole relations in the evaluation of semantic similarity. 
Furthermore, it defines a semantic-similarity function that is asymmetric for 
classes that belong to different levels of generalization in the semantic network. 
Although the model is affected by the definition of parts, functions, and attri- 
butes, it reduces the effect of the underlying semantic network when compared 
with many of the semantic distance models. 

As defined by our model, the asymmetric weights for the non-common fea- 
tures of each entity class (a, and 1 — a) add up to 1. This means that in total, 
common and different features have the same weight (i.e., 1). A further refi- 
nement can be done to define the weights a and 1 — a if we consider that, in 
the assessment of similarity, people may tend to give more importance to the 
common features (Tverski 1977, Krumhansl 1978). 

Global semantic similarity assessment for spatial scenes could also be impro- 
ved. Our approach can be used to evaluate entity-to-entity similarity to obtain 
a global optimization of the similarity between two scenes. Problems arise when 
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scenes have different numbers of spatial entities. A study of how much non- 
common entities affect global similarity assessment will help to obtain a better 
estimation of the semantic similarity between spatial scenes. 

Context has already been suggested to be a relevant issue for semantic simi- 
larity (Tversky 1977, Krumhansl 1978) and for interoperability (Kashyap and 
Sheth 1996, Bishr 1997). We expect to incorporate context, initially through 
matching a user’s intended operations with operations associated with the com- 
pared classes, in order to recognize different senses (semantics) of entity classes 
as well as to be able to define weights that reflect characteristics of a specific 
application. 

Human-subject testing will contribute to testing how closely our model re- 
sembles people’s similarity judgments. It might also provide new insights about 
how important common and non-common features are for people. 

Finally, a big challenge for our model is to evaluate similarity across multiple 
knowledge databases or ontologies. When we assumed ontologies for specific 
domains as customized by users, we found significant differences in the definition 
of concepts within a single domain. In order to move forward towards the solution 
of interoperating systems we shall need to account for these differences. We shall 
have to relax our assumption of a unique ontology, replacing it by a common 
ontology that integrates multiple and independent domain specific ontologies. 
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Abstract. Semantic interoperability in GIS is the ability to share geo- 
spatial information at the application level. This paper argues that we 
currently have a relatively clear understanding of what semantic hete- 
rogeneity means, what are real world cases that exemplify this hetero- 
geneity, and who to characterize these differences. We believe that the 
stage is now set for implementing a prototype to resolve semantic hete- 
rogeneity. We present here an on going case study and prototype deve- 
lopment. This prototype is not aimed to be a definitive solution. Rather, 
it provides a feedback process to our struggle to seek scientifically sound 
and technically and commercially viable solutions to semantic interope- 
rability. 



1 Introduction 

The more the geospatial information communities recognize that the real world 
is not separated into specific parts that exist independently from each other, 
the more they require to share information. Advanced technologies for data 
capture, such as satellites, scanners, automatic digitizing, pen-computer based 
field data recording, etc. have revolutionized data capture techniques and led to 
an increase in the availability of digital data. Along with such revolution, new 
problems arose. For example, information communities find it difficult to locate 
and retrieve data from other sources, in a reliable and acceptable form for their 
specific tasks. It is a known fact that the reuse of geodata for new applications 
is very often a lengthy process. This is due to poor documentation, obscure 
semantics, diversity of data sets, and the heterogeneity of existing systems in 
terms of data modeling concepts, data encoding techniques, storage structures, 
access functionality, etc. [8]. 

2 Worldwide Awareness 

Due to the fact that different geospatial information communities have an in- 
creasing need to share spatial information and possibly GIS services, changes 
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to the existing GIS-infrastructure are necessary and currently taking place. The 
networking between official authorities, industry and academia as well as the 
provision of geodata to the public are currently changing processes. This af- 
fects configuration management, systems engineering and integration and the 
methodologies and strategies that are applied by those disciplines. An effective 
management of cross-community technology development and Integration requi- 
res new approaches. The OpenGIS Gonsortium (OGG) introduced some of them, 
mostly in close cooperation with the GIS industry and universities. Within the 
framework of the goal setting of interoperable GIS tools the simple feature spe- 
cification is a start-up to a common agreement between GIS industry and users 
to overcome non-interoperability in the geometry sector. The idea of semantic 
translators is an approach aimed at the level of the problem-free exchange of 
thematic information. Such approaches are currently part of the discussion of a 
new orientation. They are currently part of the discussion of a new orientation 
and goal setting of the whole GIS infrastructure [16]. Semantic interoperability 
is an issue that will play an increasingly important role within this framework. 

The OGG has identified the need for open geodata sharing and the exchange 
of open GIS services. ’’Openness” requires not only open interfaces and techni- 
ques to exchange data between different GIS [14]. The concept of openness has 
definitively to include semantic interoperability, which is more than pure data 
transfer. The problem is that the goal of open information sharing is not that easy 
due to the different meanings and interpretations of data by different informa- 
tion communities. Information sharing between different geospatial information 
communities is mostly impeded by any of three factors that are summarized in 
table 1 (see [17]). [5] present some case studies on semantic non-interoperability. 
Such examples document that pure data transfer makes no sense if the members 
interpret the data differently. Users often realize, after sharing the data, that 
they can’t use them due to the specific view of the person who recorded them, 
or, in other words, data transfer was successful, but information was not shared 
[13]. 



Table 1. Impediments of information sharing 



• Ignorance of the existence of information outside one’s 
geospatial information community 

• Modeling of phenomena not of mutual interest 

• Modeling of phenomena in two representations so foreign 
to each other that each is not recognized by the other 



Geodata have geometric and the thematic aspects. Apart from geometric 
aspects, for most applications the thematic aspects of terrain description and 
analysis are of prime importance [2]. To share data and to overcome the men- 
tioned impediments, functionalities are required that, not only deal with the 
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geometric and the thematic aspects, but also with the semantic characteristics 
of spatial data. A special working group at the Institute for Geoinfromatic, IfGI, 
is just in the phase of the design of a semantic mapper prototype based on a 
case study from the area of transportation. 

Section 3 outlines some issues that should help to explain the urgent need for 
technical approaches to overcome semantic heterogeneity. Section 4 describes a 
case study of existing databases from the area of transportation. This is meant 
to be the real world application that will be used to implement the semantic 
mapper. Section 5 explains the details of a conceptualized system architecture. 
We will also show how a mapping between the different databases describes 
in the case study could be realized. The work described here is the necessary 
basis for the implementation of a semantic mapper, or, in other terms, for the 
’’technical” approach to overcome semantic heterogeneity. 

3 Approaching Technical Solutions 

The cause for semantic non-interoperability is semantic heterogeneity. Glassifica- 
tions of the types of heterogeneity differ. Table 2summarizes some of the existing 
classifications. The classification by [4] highlights the difference between objects 
at the conceptual level; i.e., semantics, and their computer representation, i.e., 
syntax and schema. It makes it possible to focus on these three issues as ’’inde- 
pendent parts” . This means that resolving syntactic and schematic heterogeneity 
is not a difficult problem if the semantic heterogeneity is resolved. We therefore 
adopt this classification in this research. 

Semantics plays a crucial role when sharing information at the application 
level. The OpenGIS Gonsortium emphasizes the ’’Model of Geographic Infor- 
mation Gommunities” that are currently not or only partly able to share infor- 
mation [15]. The exchange and transfer of spatially referenced data is possible 
’’once communities have agreed on translations between their different feature 
definitions” [14]. In many cases, this process has not yet been started. Organiza- 
tions will still need to negotiate common understandings about the semantics 
of shared geographic information, to get the most information out of the shared 
data. Such negotiations need the support of the geospatial communities inclu- 
ding GIS users in specific geospatial information communities, GIGs, academia 
and even the GIS industry. The latter should specifically pay attention to the 
special requirements of users concerning the semantic issues. In addition to the 
internal level of data structures and record formats, the main challenge to re- 
solve the semantic heterogeneity is to characterize the properties of geographic 
data needed to ensure that translation and interoperability can be achieved at 
the semantic level [10]. 

Sheth argued already in 1991 that a technique is required to support seman- 
tic reconciliation, ”...a process or technique to resolve semantic heterogeneity 
and identify semantic discrepancy, and semantic relativism that supports multi- 
ple views or interpretations of the same stored data” [19]. Gontributions to the 
Interop 1997 in Santa Barbara, Galifornia gave a comprehensive overview of the 
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Table 2. Types of heterogeneity 



Type of He- 
terogeneity 


Description 


Literature 


Generic 


Occurs when different nodes are using different generic 
models of the spatial information 


[22] 


Conceptual 


Occurs when the semantics of schemas depend upon the 
local conditions at particular nodes 


Discrepancies 
in data defini- 
tion 


Equal features are described differently 


[11] 


Differences in 
Data Structu- 
res 


The same information is represented with different data 
structures in two databases 


Semantic 


Occurs mainly due to differences in the context world 
view of different data users 


[3,4] 


Syntactic 


Occurs as a result of different representations of real 
world features as fields or as objects in the databases 
of different users. It is directly related to semantic refe- 
rence. 


Schematic 


Occurs due to different schemata, e. g. the classes, at- 
tributes and their relationships, which vary within and 
across contexts 


Naming 


Semantically alike entities in the cognitive content world 
refer to the same real world fact but have different names 


[3] 


Cognitive 


No common base of definitions of real world facts between 
two disciplines 



increasing efforts to identify and solve the problems and difficulties that occur 
in the area of semantic non-interoperability. Different paths to overcome seman- 
tic non-interoperability have been proposed. But the conference showed as well, 
that implementable (and therefore applicable) techniques are still lacking. Con- 
cerning such techniques the technique of a ’’semantic mapper” is meant to be 
such a tool with a high potential to solve special semantic non-interoperability 
problems. The main thrust of developing the prototype is the fact that there is 
no concrete implementations of a semantic mapper known to us. In the sequel 
we introduce our approach to resolve semantic non-interoperability. We will first 
introduce a case study where we show how to characterize the semantic differen- 
ces between two information communities. We then introduce an on going effort 
in our group to develop semantic mapper between GDF and ATKIS. 
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Situation in Reaiity 



■ L. J I 




Fig. 1. A two-directions road in the real world (a) is presented as one road element (b) 
in ATKIS and as two road elements in GDF (c) 

4 ATKIS and GDF: A Case Study 

Europe has a vast and extensive ground and water transportation network. Se- 
veral public and private agencies deal with transportation information, e.g., sup- 
pliers of data for car navigation systems, logistics transportation, and traffic con- 
trol, management, and analysis. These agencies usually require transportation 
information that stretches beyond national borders. For example, traffic mana- 
gement and control agencies often require transportation information collected 
by mapping agencies. 

There are several efforts to standardize transportation definitions and classifi- 
cation, e.g., ATKIS and GDF. Developed between 1989 and 1995 the Authorita- 
tive Topographic-Cartographic Information System (Amtliches Topographisch- 
Kartographisches Information-system, ATKIS) of the federal republic of Ger- 
many, is a topographic and cartographic model of reality. It models landscapes 
within digital landscape maps and the map content within digital map [1]. 

The Geographic Data Files, GDF, is a European standard released in October 
1988 and has gone through several stages of update until 1995 [7] . It aims at 
providing a reference data model to describe road networks. GDF is created 
to improve efficiency in the capture and the handling of data for geographic 
information industry, by providing a model upon which applications can be 
built, e.g., car navigation, vehicle routing, traffic analysis, etc. 
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The objectives ATKIS and GDF are different due to differences in the social 
backgrounds. These differences have created groups called geographic informa- 
tion communities (GICs’). We consider here a German topographic GIG and a 
pan European traffic management GIG who use ATKIS and GDF standards, 
respectively. We call them here for convenience, ATKIS GIG and GDF GIG, 
respectively. 

The ATKIS GIG conceptualizes transportation networks as artifacts that 
are part of landscapes, and ought to be presented in their topographic maps. 
The GDF GIG conceptualizes transportation networks as a section of the earth, 
which is designed for, or the result of any vehicular movement. 

From the GDF GIG point of view, the main purpose of a connection between 
their information system and the ATKIS information system are to provide the 
most recent and up to date information about new roads and status, e.g., to 
provide an online service for car navigation systems. From the ATKIS point of 
view, the main purposes of a connection between their information system and 
the GDF information system, is to take advantage of the GDF’s traffic flow 
information and routing information, and provide it for the local applications 
that adopts ATKIS as their base model and require more information about the 
traffic flow, direction, rules, etc. 

The problem starts by asking the question ’’does transportation network 
mean the same thing in the two GIGs’?”. 

In GDF, the term road encompasses road, railway, waterways, junctions, rail 
junctions and water junctions, while in ATKIS waterways are not considered 
part road. 

Roads in the ATKIS GIG refer to ground transportation networks. A road 
element is the smallest part of road that has a consistent width, i.e., does not 
change within a certain threshold. In GDF a road network also encompasses 
ferry connections which are not implied in ATKIS. A road element does not 
only depend on its width but also on traffic rules. For example a new road 
element is created in GDF if the direction of flow changes. 

Even the term ’’ferry network” in ATKIS refer to ferryboats, while in GDF 
a ferry is a vehicle transport facility between two fixed locations on the ’’road 
network” and which uses a prescribed mode of transport, for example, ship or 
train. This definition shows that the term road networks includes waterways. 

Gonsidering the ground transportation road network, we find that ATKIS 
include pedestrian zones, bike roads as part of a road feature, while in GDF, a 
pedestrian is not part of roads and a bike road is a type of a road network. 

Figure 1 further illustrates the difference between the two information com- 
munity. The Baker Street is a two-direction street, in ATKIS it is viewed as one 
road element that has two intersection points. In GDF the same road is presen- 
ted as two road elements, one for each direction of traffic flow. If you ask ATKIS 
GIG about the Baker Street you will get one road as shown in Figure lb. If you 
ask the same question to the GDF GIG you will get two roads of Baker Street , 
one for each traffic flow direction, as shown in Figure Ic. 
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J = Junction 
RE = Road element 

Situation in GDF Levei 2 




Fig. 2. Intersections and junctions are two different concepts in GDF but not ATKIS 

Another example of differences between ATKIS and GDF views are the term 
’’junction” and ’’intersection”. While in ATKIS there is no apparent difference 
between the two terms, in GDF a junction refers to the end points of a road 
element and an intersection is the end points of a road as shown in Figure 2. 
’’Even roads do not always mean roads. ” 

5 An Overview of the System Architecture 

In this section we introduce a system architecture to provide semantic interope- 
rability between heterogeneous systems. The system is currently under develop- 
ment. Figure 1 depicts the two main components of the system and shows where 
the semantic mapper is situated. The system consists of two main components: 
semantic mapper and a wrapper. The data source and the client install both the 
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semantic mapper and a wrapper. The semantic mappers at the source and the 
client are essentially the same. A semantic mapper is developed for a particu- 
lar application domain, e.g., for transportation, topography, telecommunication. 
The wrappers, however, are different depending on the underlying database. Fi- 
nally, the system is designed such that the semantic mapper can communicate 
with more than one wrapper if the source has more that one database for the 
application domain, as shown in Figure 3. For example, an information provider 
can have a database for highways and another for streets of a certain city. 





Fig. 3. Architecture of the prototype 



To publish data for exchange, a source needs to define what data elements 
are available and model them as objects using the library of abstract interfaces 
provided by the semantic mapper. The semantic mapper and the wrapper are 
called collectively middleware. Middleware provides a unified view and common 
interfaces, for new applications, of heterogeneous legacy systems while maintai- 
ning their autonomy [18, 21]. Middleware typically relies on wrappers that act 
as an interface between the underlying legacy database and the client. 



5.1 The Semantic Mapper 

Applications that adopt the semantic mapper see heterogeneous legacy data 
stored in a variety of data sources as object instances that have well known 
interfaces. These object instances are provided by the wrapper, as will be shown 
in the next section. The mapper has two main components: a library of well 
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known interfaces and a domain ontology. The semantic mapper provides five 
main services. 

1. A communication manager between the information provider and the client. 

2. Provides a library of abstract interfaces. These interfaces are implemented 
by the wrapper and allow sending and retrieving objects from and to its 
underlying database. 

3. Wrappers model the underlying database as semantic mapper objects. When 
these objects are received at the client, the mapper identifies and resolves 
the semantic and the schematic differences between the client database and 
the newly received objects. 

4. A mechanism that guarantees that each outgoing object from the underlying 
data source has a universal unique object ID. 

5. It cooperates with the wrapper in query planing and execution, where it 
identifies parts of the query that belong to each wrapper if the query ranges 
over several wrappers (within the space of one data source). 



Carriage way 



Name 
Short name 
Number of lanes 
Surface material 



(a) 



(Carriage way, road) 

(Name, Name) 

(Short name, abbreviated name) 
(Number of lanes, paths count) 
(Surface material, pavement) 

(b) 



Fig. 4. Ontology is assigned to each outgoing object from the wrapper 



5.2 Domain Ontology 

In the knowledge representation literature, ontology refers to formal conceptua- 
lization of specific domains. The emerging field of ontological engineering has 
led to several attempts to create libraries of sharable ontology. For example, the 
KOSMOS project has produced a domain ontology for machine translation from 
Spanish to English in the filed of economy and cooperate mergers. Information 
exchange between sources and clients requires a sharing of ontology [4]. An on- 
tology of geographic kinds, of the categories or entity types in the domain of 
geographic objects, is designed to yield a better understanding of the structure 
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of the geographic world [20] . The ontology publisher develops and maintains do- 
main ontology libraries. The development of ontology and its related issues are 
outside the scope of this paper and can be found elsewhere [6, 9, 12]. 

It is important to note here that in our system architecture it is assumed that 
the ontology is shared and committed by the clients and the sources who are 
willing to share and exchange their information. The domain ontology library, 
as an embedded component in the semantic mapper, helps resolve the semantic 
conflicts between the data source and the client. Each object received by the 
semantic mapper from the wrapper has an attribute that takes its value from 
the shared ontology. 

Figure 4, shows a class carriage way as represented in the database of the 
information provider. The class has four attributes. We collectively call classes 
and the attributes: schema elements. These schema elements are associated to 
an ontological term brought from the shared ontology, as an ordered pair, as 
shown in Figure 4 b. 



5.3 Constructing the Wrapper 

A wrapper is an interface between the underlying database and the semantic 
mapper. Its main functionality is to provide a protocol for the communication 
between them. Wrappers provide three main services: 

1. Model the instances of the underlying database as objects understood by the 
semantic mapper. This is achieved by a reference model, provided by the se- 
mantic mapper, that defines abstract well-known interfaces and implemented 
by the wrapper. 

2. Participate in the query planning and execution. 

3. Implement, together with the semantic mapper, a comprehensive schema 
for object ID to uniquely identify objects in a heterogeneous distributed 
environment. 



OID=: XXX 




01D= XXX 




OID= XXX 


attributes: name, short name, 
number of lanes, surface 
material 

Schema type: object_instance 
Type: carriage way 
Ontology: road 




Name: Short Name 
Value: A6 

Domain: Al, A2, A6 
Schema type: attribute 
Type: string 
Class: Carriage Way 
Ontology: Abbreviated name 




Name: A6 
Schema type: value 
Type: string 
Parent: Short Name 
Ontology: Abbreviated name 



(a) (b) (c) 



Fig. 5. Objects sent by the wrapper to the semanctic mapper 
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Figure 5 shows a simple example of three object instances sent by the wrapper 
to the semantic mapper. The objects are identified by a unique ID. As depicted in 
Figure 5a, in addition to the attribute that indicates the corresponding ontology 
term, the object Carriage_way has three attributes. 

AttributeTist has the list of the attribute names that belong to that object. 

DB type indicates if the object is originally an attribute or an object in- 
stance. Hence, its value can be attribute or object-instance. In our example the 
value is objectJnstance. 

Type indicates the type of the object. The object in consideration is ori- 
ginally an instance of a class then its type is that of the class. 

In our system, attributes and their domains are also posted to the seman- 
tic mapper as object instances as shown in Figure 5 b and c. The attribute 
short_name is posted as an object instance with a unique ID. 

6 Conclusions 

Interoperability in general and semantic interoperability will undoubtedly lead 
to drastic organizational changes in the GI community. In this paper we have 
shown that semantic heterogeneity has to be resolved first before the syntactic 
and schematic ones. Semantic differences between ATKIS and GDF were presen- 
ted. The semantic mapper and the wrapper architecture were presented as our 
approach to the problem. Both component use the interface technology to com- 
municate with each other. The basic idea of the architecture is that wrappers 
post objects of their underlying databases, to the mapper as object instances 
with well known interfaces. The object instances hold detailed information ab- 
out their semantics and their ontological reference. 

Our future task will focus on defining the well known interfaces as well as 
the relevant information that characterizes object semantics between ATKIS 
and GDF. Our methodology is to keep our list of the prototype functionality 
short in order to speed up the feedback cycle. This research paper presents the 
technical efforts to achieve semantic interoperability. We are also probing the 
theoretical ground in search for theories to characterize semantic heterogeneity 
and similarities. 
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Abstract. Beyond technical aspects, interoperability of Geographic In- 
formation Systems is widely recognized to include semantic aspects. It 
is not yet clearly understood, however, which problems in data sharing 
stem from semantic issues and how they affect people’s work. In order 
to gain a better understanding, this paper takes noise abatement related 
to sports grounds as a practical example of data sharing and examines 
in detail the semantic issues arising with the usage of two key terms. 
The first term we look at is ’’relevant sports ground”. The selection of 
sports grounds that are considered by noise abatement planning is based 
on it. The second term is ’’sports ground” as such. The understanding 
of it determines how sports grounds are modeled in various data sets 
that might serve as an information source for noise assessment. We show 
that the meaning of sports grounds lies in their usage, at least in the 
case of noise abatement, and that available data sets do not contain the 
necessary usage information. Furthermore, we show that the available 
geometric data only become useful when several sources are combined: 
interoperability is required to map between data and user semantics. 



1 Introduction 

A great deal of current research attempts to find means for describing the se- 
mantics of data in order to support data sharing [8, 5, 10]. This research is 
motivated by the belief that successful interoperability is not only a technical 
matter of accessing distributed components, but requires a documented under- 
standing of what the information held by these components means. The work 
reported here sprang from that same position, but evolved to demonstrate that 
the reverse claim is true as well: coping with the semantics of different applicati- 
ons requires interoperability. Only the kind of interoperability based on service 
interfaces can cope with the necessary semantic mappings between user requests 
and information sources. Data transfers cannot generally satisfy the information 
needs of applications cutting across information communities, even if the data 
were complete and with documented unambiguous meanings. 

The paper presents a detailed case study of an application taken from or- 
dinary planning practice in Germany. Imagine a local environmental authority 
that has to assess the noise in an urban area coming from sports grounds. Be- 
fore actually assessing the impact, the responsible person has to find out which 
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sports grounds represent relevant sources and then build an emission model. 
Both tasks require a lot of data which the environmental authority usually does 
not maintain, but which are spread over the sports authority, sports clubs, the 
surveying and cadastral authority and possibly other authorities having to do 
with sports grounds, depending on the local administrative organization. This 
indicates how many different people who have nothing to do with noise reduction 
issues must be asked for information, how often questions will not lead to the 
expected answers, how many data sources have to be examined, and how long it 
can take to merge all the data to the information product the responsible person 
needs to fulfill the tasks - if it is possible to create such a product at all. 

From a semantic perspective, the planners are struggling with the meaning of 
the term ’’sports ground” in the context of noise abatement. This meaning rests 
on the perspective on sports grounds of a person living in their neighborhood, 
not that of a surveyor, cartographer or land use planner. But also not that of a 
public transports planner or social worker. Since most available data on sports 
grounds describe a perspective from surveying, mapping and possibly land use 
statistics, neither the noise expert nor the transportation planner nor the social 
worker are likely to find their understanding of sports grounds reflected in them. 
Consequently, the best means of describing data semantics may not be sufficient 
to answer their questions. 

In an ideal world, these application specialists would be able to query distri- 
buted information resources for answers to their questions rather than individual 
databases for their contents. It would be an agent’s task to determine whether 
and where there are data answering the questions or allowing for inferences to 
support one or the other hypothesis. In the case of sports grounds, such data 
may come from use statistics, traffic counts, regulations, surveys, remote sensing 
images, etc. The crucial challenge that motivates our research is how to bridge 
the gap between the data and the questions. We take it up by first studying in 
great detail some specific questions and data sources. The paper first describes 
the task setting, then discusses the semantics of relevant sports grounds and 
sports grounds in general, and concludes with observations on the multi-layered 
nature of such practical semantic problems. The somewhat surprising result is 
that these kinds of semantic issues are not just aggravated by interoperability, 
but require interoperability to be dealt with at all. 



2 The Task: Noise Abatement 

Sports activities make noise. These noise emissions are addressed in the Ger- 
man “Larmminderungsplanung” (noise abatement planning), an administrative 
procedure dealing with noise impact (German: Immission) coming from diffe- 
rent sources, e. g. traffic, industry, and sports. When assessing the noise emitted 
from sports grounds, the understanding of terms plays an important role, be- 
cause it determines which sports grounds have to be considered and how they 
are modeled. 
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Noise abatement planning is designed as a small-scale instrument for urban 
areas. Only those sports grounds that contribute significantly to the noise situa- 
tion are examined in detail. Consequently, as a first step the following question 
must be answered: ’’Which sports grounds are relevant to noise abatement plan- 
ning?” The term to be examined in this context is relevant sports ground (section 
3). 

When the selected relevant sports grounds are looked at in detail, it is sen- 
sible to use existing (digital) information for building the emissions model on 
which the calculation of noise impact is based, if the data meet acoustic requi- 
rements. This leads to the question: ’’What data sets can be used in modeling 
sports grounds?” In other words: ”Do data sets exist which have the same un- 
derstanding of a sports ground as an acoustic expert?” The term to be examined 
in this context is sports ground (section 4). 

In the following two sections the meaning of the two terms to different groups 
of people are described. The difficulties arising from different understandings are 
pointed out as well. 



3 Relevant Sports Grounds 

The local administration is responsible for noise abatement planning; thus in- 
structions contained in the law must be taken into consideration first. Further- 
more, it is useful to have a look at how such a task is executed, because it involves 
people who did not create the laws and because it usually takes refinement and 
interpretation to put legal instructions into practice. Table 1 shows how laws 
and other instructions in the federal state of North Rhine- Westphalia deal with 
the relevance of sports grounds in noise abatement planning. 

Table 1 reveals that there is only one explicit definition of sports grounds 
available (given by [11], see row 3). The community of noise assessment seems to 
have a common basic understanding of the term sports ground. In the criteria 
for the relevance of sports grounds differences show up. The closer it comes to 
practice more criteria are found and they become more precise, too. In addition, 
the kind of criteria changes: the law refers to size (certainly taking it as a proxy 
for aspects of usage), and practice refers mainly to usage. In fact practice does 
not even mention size, but several more precise aspects of usage. A large number 
of criteria with high precision might be useful for understanding, but can also 
prevent information access: people might be overwhelmed by the amount and 
detail of questions and put them aside until they have enough time to answer 
them. Or information is gathered on a less detailed level and the questions cannot 
be answered at all. For a sports ground without scheduled use, e. g., nobody keeps 
a record of the hours of usage and the number of people present. In this case 
it becomes necessary to define fewer criteria and dispense with accuracy. Still, 
basically the same attributes are addressed. One considers, e. g., probability of 
usage instead of exact training and competition hours (see table 1, rows 5 and 
6 ). 
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Obviously different answers to the question ’’Which sports grounds are re- 
levant to noise abatement planning?” will produce different results in impact 



Table 1. Relevance of sports grounds in North Rhine- Westphalia according to various 
sources 



Regulation and in- 
formation source 


Aim; 

contents; 

definition of sports 
grounds 


Criteria for relevance of 
sports grounds 


1. Federal law 
for protection 
against noise 
impact [4, §47a]) 


introduces noise abatement 
planning; does not mention 
special noise sources like 
sports grounds; no definition 


no criteria mentioned 


2. Instructions 
for administration 
concerning the 
federal law; model 
[9] and version for 
North 

Rhine- Westphalia 
[12] 


make instructions by law 
more concrete and 
practicable; 

mention sports grounds as 
noise sources; 
no definition 


• larger sports grounds 


3. Instruction 
for protection 
against the noise 
of sports grounds 
[11] 


defines a uniform method 
for the assessment of noise 
impact caused by sports 
grounds; 

assessment procedure; 
stationary facilities intended 
for doing sports, including 
facilities that have close 
proximity, spatially and 
operationally 


no criteria mentioned 


4. North 
Rhine- 
Westphalian 
guide for making 
noise impact 
plans [6] 


refines legal instructions to 
put them into practice; 
no definition 


• competitions 

• preparation of competitions 

• run by municipalities, 
clubs, enterprises 

• considerable noise emission 
examples: 

football fields with more 
than 200 spectators, tennis 
complex with more than 3 
courts 





What Are Sports Grounds? 221 



Regulation and in- 
formation source 


Aim; 

contents; 

definition of sports 
grounds 


Criteria for relevance 
of sports grounds 


5. Practice, 
version for sports 
grounds with 
scheduled use 


carries out noise impact 
assessment 


• outdoors 

• ball games like soccer or 
tennis 

• regular usage 

• used by clubs 

• competitions on Sundays 
between 1 p.m. and 3 
p.m. 

• many spectators 

• training after 8 p.m. 

• usage of loud-speakers 

• residential buildings in 
the neighborhood (within 
a radius of 200 m in case 
of a usage before 8 p.m. 
or after 10 p.m., 
otherwise within a radius 
of 100 m) 


6. Practice, 

version for sports 
grounds without 
scheduled use 


carries out noise impact 
assessment 


• outdoors 

• noisy forms of sports 

• acceptance by population 

• high probability of usage 
after 8 p.m. 

• much usage between 8 
a.m. and 8 p.m. (counted 
in 25, 50 or 75 %) 



assessment. This creates a situation which administration must avoid as much 
as possible. Furthermore, there is the issue of communication. If you have to 
ask other authorities for data, which happens regularly with noise abatement 
planning, you must be able to give a precise description of what you want. This 
requires to be clear in one’s own mind about the requirements and able to ex- 
press them. If you ask for a list of all larger sports grounds and thus take law 
literally, you might be presented with many grounds that are not used regularly 
and consequently of no interest. On the other hand, important smaller grounds 
being used regularly will not appear. Aside from this, appropriate questions must 
be asked that take into account the availability of information. Otherwise, you 
may not receive any information at all or only ill-fitting data. 
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4 Sports Grounds 

The ’’Sportzentrum Roxel”, a sports complex in Miinster, was chosen as a case 
study to compare different views on the same real world object. It is a larger 
sports ground comprising several playing-fields of various sizes dedicated to dif- 
ferent forms of sport, a stand for spectators and parking lots which, according 
to a legal definition [11], belong to the sports ground. The following descriptions 
only take into account geometry, because the other attributes are too specific to 
appear in existing data sets whereas geometry can be directly compared. 



4.1 Aerial Photograph - The Real World Situation 

The aerial photograph in fig. 1 is meant to give an ’’objective” view of the 
sports ground, as far as this is possible. There is a large pitch in the center of 
the complex, the main field, where soccer is played mostly. It is surrounded by 
track and field areas, and, south of it, a stand for spectators adjoins. In the 
northeast another large field is located, which is also mainly used for soccer. 
Between these fields you can see two smaller multipurpose fields (for basketball, 
handball, and volleyball), which - in contrast to the large fields - are free to 
be used by everyone and are not subject to any schedule. North of them, there 
is another track and field area. Ten tennis-courts are situated in the southwest 
corner and east of them, parking lots can be seen. A beach volleyball field lies 
next to the parking lots, and east of it there is an indoor swimming pool. The 
building at the eastern edge is a gym. 



4.2 Acoustic Model 

Fig. 2 shows how an expert in acoustics models the real world situation. Only 
those parts are to be found that are important from an acoustic point of view, 
the sources of noise. Noise can be emitted by players, spectators and cars. But 
some of the areas, where these sources originate, do not appear in the model due 
to a low frequency of usage, the form of sports, or because they represent indoor 
facilities. They are irrelevant for noise abatement planning. The two large fields, 
the tennis-courts, the stand, and the parking lots are relevant sources that must 
be looked at. 

4.3 Cadastral Data (See [7]) 

In fig. 3, cadastral data (” Automatisierte Liegenschaftskarte” , ALK) of the city 
of Miinster for the same area are shown. Cadastral data are meant to provide 
information about location, shape and size of parcels as a basis for property 
documentation and taxation. From this perspective, sports grounds are not an 
object of primary interest. They belong to the supplementary topography, which 
is not registered systematically. This is why only few parts of the sports complex 
appear. There are three objects, each of which represents a generalization of two 
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Fig. 1. Aerial photograph of the ” Sportzentrum Roxel” 



tennis-courts. In contrast to that, the parking lots are modeled in detail. The 
catalogue of ALK-objects in addition contains small and large fields, which is 
why we can expect corresponding extensions of the current map at some time, 
but the catalogue does not provide an object for the stand. 



4.4 Topographic Data 



The intention of topographic data is to show the surface of the earth and the 
objects on it. Depending on scale and purpose, the modeling of real world objects 
in topographic maps varies. Fig. 4 contains two examples for this kind of data. 
On the one hand, a part of the German base map 1:5000 (’’Deutsche Grundkarte 
1:5000”, DGK 5, here not shown to scale) appears in black lines. This map is 
available digitally only in raster format. On the other hand, digital vector data 
of the Authoritative Topographic-Gartographic Information System (’’ATKIS”) 
are shown in gray. The contents of ATKIS are comparable to a topographic map 
with a scale of 1:25000. 
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Fig. 2. An acoustic model of the ’’Sportzentrum Roxel” 



DGK 5 (see [1]). In DGK 5, all fields are represented, but there are gene- 
ralizations: the playing field in the oval area is not a separate object, and two 
adjacent tennis-courts are represented by one object. The parking lots are gene- 
ralized as well by depicting just their outlines. Since the outlines are not closed, 
they merge with the street to one object. The DGK 5 does not have an object 
’’stand”. 



ATKIS (see [2]). In ATKIS, the whole sports complex is depicted by one large 
object. ATKIS also comprises objects like playing field, stand and parking lot, 
which could show more details, but they have not been included here yet. In 
the future, updates will create a more detailed model of the sports complex in 
ATKIS. 



4.5 Comparison of the Models 

The acoustic model requires the existence of certain objects with a certain level 
of generalization. The large fields, e. g., must be represented by rectangular 
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Fig. 3. The ’’Sportzentrum Roxel” in the ALK 



objects. Thus, the generalized oval object in the DGK 5 is not acceptable. The 
parking lots just require outlines; a subdivision in places to park and areas 
with bushes and trees between them is too detailed for the purpose of noise 
abatement planning. In this respect the generalization of the DGK 5 is better 
than the detailed cadastral data. For the consideration of spectators, which is 
only necessary for the large fields and not for tennis-courts, there are several 
possibilities. The easiest way is to assume spectators on the field together with 
the players and the referee. But this method is only applied for small numbers of 
spectators. When the number of spectators increases, more exactness is needed 
and consequently geometry must change. Either the stand must be introduced 
as an additional object, or the rectangle of the field must be enlarged by several 
meters at the long sides. This will somewhat improve results near the field. 

This means that all data sets show weaknesses and cannot be used alone. In 
the cadastral data, many objects are missing at this time, whereas parking lots 
are depicted in too much detail. The DGK 5 is not detailed enough regarding 
the oval field. ATKIS is lacking any useful object at this time, but it is the only 
data set which could provide an object ’’stand” in future. 
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Fig. 4. The ” Sportzentrum Roxel” in DGK 5 and ATKIS 



All data collections basically concur that there is an area dedicated to sports, 
but each set has its own perspective differing from the acoustic perspective. Con- 
sequently each set provides at best part of what is needed for noise impact asses- 
sment. To get an appropriate digital acoustic model, parts of (future) cadastral 
and future ATKIS data together can be taken as a basis and completed by di- 
gitizing supplementary information out of the DGK 5. Currently, existing data 
are often digitized again, or data not exactly meeting requirements are used, 
although more appropriate data exist. This occurs, because users are not pre- 
sented with the data they need and because it is often too complicated to build 
the necessary data out of (several) existing sets. Support for adapting data to 
the requirements of users (e. g. dividing the square tennis areas of the cadastral 
data into two tennis-courts) and merging digital data collections could improve 
this situation. This would mean, on the one hand, improving the use of available 
data and, on the other hand, giving access to the most appropriate data and 
thus making possible the best results for a given task. 
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5 Conclusions 

Our case study on sports grounds has established a series of shortcomings in the 
current practice of using spatial data. It was originally intended to document 
specific semantic problems arising in an application picked more or less at ran- 
dom. In this direction, we have at least been able to establish a case of practical 
needs for semantic information and a detailed account of practical semantic is- 
sues arising in a particular case. The paper thereby makes a contribution to the 
first phase of an ongoing project that collects and analyzes specific examples of 
semantic interoperability issues (http://ifgi.uni-muenster.de/english/3_ 
projects/sip/index.html). The challenges we encountered, however, are far 
more complex than the kind of semantic differences one might try to resolve by 
semantic translation [3]. They reveal several layers of technical, organizational 
and cognitive issues that cannot be resolved by putting more semantics into 
spatial data models or component interfaces. In conclusion, we attempt to draw 
up an incomplete list of these issues. Some of them are already well known and 
documented. Others suggest that information sharing is a much more complex 
problem than the GI community might have expected. Furthermore, they sepa- 
rate this complexity into several smaller parts. They demonstrate the need for 
interoperability and the inadequate role of data transfers in applications cut- 
ting across traditional boundaries of information communities. The case study 
established that: 

1. Phenomena of interest to ’’spatially aware professionals”, such as sports gro- 
unds in the context of noise abatement, are today most often captured in 
some kind of spatial data. 

2. These data are generally held outside the organization of the interested pro- 
fessionals and consequently difficult to access and assess for suitability. 

3. Most existing data reflect the particular semantics of the Surveying and 
Mapping Information Community. It is firmly grounded in planar geometry 
and has little to say about people, processes of use, or temporal and physical 
attributes. 

4. The questions that can be answered by certain data collections are often 
several steps of interpretation away from those raised by an application. 

5. All data necessarily exhibit certain levels of quality, along the usual axes 
of accuracy, resolution, completeness, consistency and currency. These dif- 
ferent degrees of quality become very obvious when attempting to integrate 
information from multiple sources. 

6. Communication (with people as well as with computer systems) plays an 
important role in data sharing, because data requests must be unambiguous 
in order to be successful. Planning tasks mostly involve different information 
communities, which makes the communication issue even more important. 

7. It often takes too much time to find, assess and use data. Especially when 
the amount of data required is not too large, as in our example, data are 
rather digitized again for economical reasons. 




228 



C. Riedemann and W. Kuhn 



While it appears too difficult yet to order these issues with respect to their 
amenability to practical solutions, the list shows that, in order to make spa- 
tial information viable, one needs to transcend the limitations of thinking in 
terms of data collections and data transfers. The needs for spatial information 
have to be addressed from the user’s perspectives and supported by technical 
as well as institutional means to translate these needs into requests for services 
that search and exploit data collections. On the road to making this user-driven 
scenario reality, the establishing of spatial data warehouses by producing some 
kind of canonical representations of the landscape, documenting them through 
metadata, and making them accessible by exchange formats is only a beginning. 
A second step, implemented by OpenGIS® architectures, is to create service in- 
terfaces that offer answers to questions and not just excerpts from databases 
in some cryptic exchange format. A third step, investigated in the research on 
semantic interoperability is to enrich these interfaces by means to capture more 
semantics. The real breakthrough to an improved usability of spatial data, ho- 
wever, will only come with a better understanding of the mechanisms needed 
to map user questions to data through service interfaces. Our research agenda 
targets the identification and formalization of such semantic mappings [5] . The 
case study presented here is a first attempt at understanding their nature and 
requirements. 
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Abstract. For data to be successfully integrated, semantically similar 
database elements must be identified as candidates for merging. However, 
there may be significant differences between the concepts that partici- 
pants in the integration exercise hold for the same real world entity. A 
possible method for identifying semantically similar elements prior to 
integration is based on cognitive science theory of concept attainment. 
The theory identifies inclusion rules as being the basis for the highest 
level of concept attainment, once concepts have been attained at lower, 
perceptive levels. Predicates can be used to combine inclusion rules as a 
basis for semantic representation of elements. The predicates for different 
database elements can then be compared to determine the similarities 
and differences between the elements. This information can be used to 
develop a set of semantically similar elements, and then to resolve repre- 
sentational conflicts between the elements prior to integration. 



1 Introduction 

Since the popularization of spatial information systems, a number of researchers 
have discussed the benefits of spatial data sharing [31]. However, in recent years, 
data sharing has become increasingly important. This increase has been motiva- 
ted by growing environmental concerns, pressures on government and the private 
sector to perform more efficiently, recognition of the synergistic advantages of 
spatial data and the increasing availability of a wide range of data [33], [21]. 

Despite agreement that spatial data sharing is an important goal, attempts 
to achieve such a goal have often been frustrated by heterogeneity in the data. 
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Data heterogeneity can be classified into schematic heterogeneity, syntactic he- 
terogeneity and semantic heterogeneity [3] . If any of these types of heterogeneity 
exist, data from different sources cannot be readily integrated for use in problem 
solving. 

Schematic heterogeneity refers to differences in the type of database elements 
that are used to represent a particular real world entity (for example, attribute, 
relation or class) [3]. 

Syntactic heterogeneity refers to differences in the structures of the elements 
that are used to represent real world entities. For spatial data, structure relates 
to the type of geometric element (for example, point, line or polygon), and 
the characteristics of that element. For non-spatial data, structure relates to 
the representational details of the database element (for example, data type, 
constraints or domain) [3]. 

Semantic heterogeneity refers to differences in the definition of concepts and 
the rules that are used to determine whether a real world entity is an example of 
a concept. Semantic heterogeneity is the source of most data sharing problems 
[3], and is the focus of the research described in this paper. Semantically simi- 
lar database elements must be identified and any heterogeneity resolved before 
schematic and syntactic heterogeneity can be addressed. Methods for resolution 
of the latter two types of heterogeneity are provided by [9], [39], [30] and [10]. 

Semantic heterogeneity between individuals can be significant. The semantics 
that individuals have for a particular real world entity vary depending on the 
concepts or categories they use to classify the entities they encounter in ever- 
yday life. These concepts differ between individuals depending on education, 
experiences and theoretical assumptions [29]. For some applications (for exam- 
ple, hospital management systems), a certain level of similarity in world views 
can be assumed [46]. However, this assumption is not valid for spatial data, as 
users with a wide range of backgrounds are likely to be interested in spatial data 
due to its fundamental nature [33]. 

The OGIS information communities model groups people that have the same 
semantics for a particular set of concepts together into information communi- 
ties^. Data sharing within information communities is relatively easy because the 
members of the community use similar concepts. However, if data is to be sha- 
red between communities, semantic similarity cannot be assumed. In this case, 
sharing requires that semantic similarities and differences between elements can 
be identified and resolved [33]. 

Methods for determining the semantic similarity of elements proposed to 
date can be divided into two main groups. The methods in the first group are 
concerned with the identification of semantic similarity based on representatio- 
nal (syntactic) details. This includes element characteristics [20] ,[7], or at the 
more complex end of the spectrum, behavior [12], [19]. Elements are considered 

^ ”An information community is a collection of people. ..who, at least part of the 
time, share a common digital geographic information language and share common 
spatial feature definitions. This implies a common world view as well as common 
abstractions, feature representations, and metadata.” [33] 
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semantically similar if they have similar representation or behavior respectively. 
These methods are limited in application in semantically heterogeneous environ- 
ments because they assume that semantic similarity is correlated with syntactic 
similarity. 

The methods in the second group are more concerned with interrogation of 
the semantics of elements, and examine definitions [3] or terminological relati- 
onships [6], [13] in an attempt to determine semantic similarity. These methods 
are limited in their ability to identify similarities and differences because they 
usually assume that a common language is used (that is, a given term has the 
same meaning across all databases). This assumption is often invalid in the con- 
text of spatial data, because the same term may be used by different information 
communities to mean different things. Similarly, different terms may be used by 
different information communities to mean the same thing. 

This paper describes a new method for identifying semantic similarity of 
database elements, referred to as the inclusion rules method. The method is 
an extension of some of the methods that fall into the second group in the 
previous paragraph, and applies psychological theories of concept attainment to 
the problem of the representation of element semantics. The method attempts 
to increase the applicability of the previously proposed methods by reducing 
reliance on the assumption that participants use a common set of terms and 
definitions. This goal is particularly important for spatial database integration, 
because users of spatial data often have widely varying backgrounds and thus 
different semantics for real world entities. Consequently, this paper focuses on the 
use of the method for spatial data. However, many other database applications 
involve a similarly diverse range of individuals or groups, and the method is 
equally suitable for these applications. 

The proposed method allows users to represent the semantics of their own 
database elements with a predicate that combines several inclusion rules. Inclu- 
sion rules are rules that indicate the characteristics that an instance must have 
to be considered an example of that element. The predicates are then used to 
determine the similarities and differences between any two database elements. 
The information about the relationship of the elements provides a tool for re- 
solution of variations and ultimately, element integration, although this is not 
described in this paper due to space limitations (refer to [47] for details). 

The next section in this paper briefly reviews the methods that have been 
suggested by other researchers, both from conventional and spatial database 
fields of research, for identifying semantic similarity. Section 3 provides an in- 
troduction to the inclusion rules method, including a review of the psychological 
theory that provides its foundation, and defines a formal language for the in- 
clusion rules method. Section 4 contains an example of the method and its use 
in representing element semantics and determining the semantic equivalence of 
database elements. 

This paper uses the term real world entity to refer to some ’thing’ experienced 
in the real world. Individuals group these real world entities into concepts in order 
to deal with the world. Concepts are defined according to a set of rules that 
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dictate whether an entity is an example of the concept or not. If the experienced 
real world entity fulfills those rules, it is considered an example of the concept. 

In an object oriented database, concepts are implemented as either classes 
or attributes, and real world entities as either objects of the classes or values of 
the attributes respectively [11], [38]. Similarly, in a relational database, concepts 
are implemented as either relations or attributes and real world entities as tuples 
of the relations or values of the attributes respectively [11]. In the remainder of 
this paper, the term element is used to refer to any database representation of a 
concept (that is, class, attribute or tuple), and the term instance is used to refer 
to any database representation of a real world entity (that is, object or value). 

2 Review of Integration Methods 

As described in the previous section, database integration methods must be 
capable of identifying both the differences and the similarities between elements. 
Where elements differ, it must be possible to identify how they differ. 

The methods proposed by database researchers for identifying semantic si- 
milarities and differences can be divided into two groups. The methods in the 
first group take advantage of the representation of databases in various ways, 
comparing representations to identify semantically equivalent elements. The me- 
thods in the second group attempt to identify semantically equivalent elements 
by interrogating definitions or relationships between terms. This section provides 
a very brief review of these methods. The interested reader should refer to the 
relevant references for more information. 



2.1 Representational Methods 

The representational database integration methods use the characteristics of 
each database element to determine semantic equivalence. A common approach 
is to determine attribute equivalence by comparing structural characteristics 
including uniqueness, domain, constraints, allowable operations and units [20], 
and then relation equivalence on the basis of the equivalence of either all the 
component attributes [7], or only the key attributes [20]. 

A more general form of this method uses a combination of context and domain 
to determine semantic equivalence. The context is some representation of the 
underlying assumptions of the element, and may take the form of the database, 
a set of named domains or a rule based formal expression [16], [41]. 

Another method combines usage and access patterns with other representa- 
tional information to determine semantic equivalence [46] . 

The use of the representational details to determine semantic equivalence has 
been applied to the non-spatial components of spatial information systems [31], 
with the addition of role information. This latter information refers to the use 
or meaning of the element, and is stored in the data dictionary [32]. 

Moving to the more complex end of the representational spectrum, some 
researchers have suggested that element behavior be compared to determine 
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semantic equivalence. A common approach is for behavioral equivalence to be 
determined by running all operations that are defined on an element on itself, 
and then comparing the results with the same operations run on the element to 
which it is being compared [12]. A more sophisticated variation on this approach 
uses universal algebra to define the set of operations that describe an element 
in terms of its use or purpose. The method adopts a mathematical device called 
homomorphism to compare the algebras of different elements and determine 
whether they are similar [19]. 

A limitation of the representational methods is their reliance on the ways 
individuals model their data, or the methods they attach to their elements, both 
of which will depend on their individual semantics. This approach does not take 
element semantics into account, meaning that elements that are semantically 
very different may be considered equivalent. 

2.2 Semantic Methods 

Two main types of semantic methods have been suggested: those based on termi- 
nological relationships and those based on definitions. The former use networks 
or hierarchies that indicate the relationships (including synonyms and genera- 
lizations) between either the names given to database elements [14] or the terms 
used in element definitions [50]. There is usually a common taxonomy of terms 
that is used by all those involved in data sharing. The terms used in the local 
databases are mapped to the common taxonomy, and the links between terms 
from different databases can be used to determine their semantic similarity [6], 
[13], [3]. A variation requires the user to define semantic clusters across all da- 
tabases, which are then used to automatically generate an associative network 

[ 42 ]. 

The element definitions method involves a simple comparison of the defini- 
tions of the elements being compared. This method has been applied to spatial 
information, using the classification criteria for an element as its definition. Ho- 
wever, these definitions are used to resolve schematic heterogeneity rather than 
determine semantic equivalence [3]. 

Another method combines terminological relationships and element defini- 
tions in a knowledge base. Other types of information like organizational rules 
may also be included [50], [8], [6], [15]. 

These semantic methods usually assume that a common language is used 
(that is, a given term has the same meaning across all databases). Although 
mappings between the individual’s terminology and that of the common language 
is possible, this only allows a direct one to one mapping between terms, and 
does not consider the more subtle differences in the classification criteria used 
for terms by different individuals. 

These requirements are limiting for information that has a heterogeneous 
nature, including spatial information. These limitations are supported by Mark’s 
findings that linguistic labels are limited in their ability to convey semantics [24], 
and Kuhn’s comments about the inadequacy of terms for the representation of 
meaning [19]. 
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3 The Inclusion Rules Method 

The inclusion rules method aims to take advantage of the strengths of the se- 
mantic methods, but reduce the need for a common language that limits the 
application of the semantic methods to information that is used by a number of 
different information communities (like spatial information). The method does 
this by applying a cognitive science theory of concept attainment to the pro- 
blem. This section provides a theoretical background to the method, and gives 
a formal definition of relevant elements. 

3.1 Theoretical Background 

Psychological theory asserts that individuals organize the world in terms of ca- 
tegories of things that have similar characteristics or attributes, and that these 
categories make up the semantics of real world entities from the individual’s per- 
spective. The characteristics are selected according to underlying theories that 
individuals hold about which characteristics are important, and hence may differ 
from one individual to another [28] . 

Psychological research has dedicated much effort to the question of how ca- 
tegories (more commonly referred to as concepts), are learnt, and then how 
they are used to classify newly experienced real world entities. These theories 
are important for research into database integration, because they provide an 
indication of how concepts differ between individuals, and following on from 
this, how it might be possible to translate between concepts held by different 
individuals so that valid integration is possible. 

A number of psychological theories have been suggested in terms of how 
concepts are attained, many having a number of similarities. The database in- 
tegration method developed in this research is based on the theory of concept 
attainment suggested by Klausmeier, Ghatala and Prayer [18]. This theory, or 
model, is based on several years of experimentation, as well as work by other 
eminent psychological researchers (including Piaget and Inhelder [36], who have 
carried out a series of authoritative studies in this area) [18]. 

Klausmeier et al’s model involves a series of levels of concept attainment: 
the concrete level, the identity level, the classificatory level and the formal level. 
The concrete level is similar to perception, and is reached when the individual 
is aware of an entity. The identity level is reached when the individual sees the 
same entity in different situations, positions or orientations, and realizes that 
it is the same entity. The classificatory level is reached when the individual 
groups entities together based on some similarity, but is not able to formalize 
the criteria for this grouping. Finally, at the formal level, the individual is able to 
define the group of characteristics that define a concept, and to identify inclusion 
and exclusion rules in terms of those characteristics [18]. 

Klausmeier et al’s theory suggests that concepts change over time, and may 
be affected by a number of different external factors. A possible criticism of the 
application of the the theory to database semantic representation is that the 
theory applies to the concepts of individuals; but that database elements are 
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more likely to represent the concepts of a group as defined during the database 
analysis and design process. It is suggested that there is unlikely to be an in- 
compatability between an individual database user’s concepts and the concepts 
of the database itself for the following reasons: 

— the database concepts are developed from the concepts of the database users, 
and since those users have similar experiences with the data and operate in 
a similar environment, they will probably have similar semantics for the 
database elements they share [29] and 

— it would be difficult for a user to regularly use a database over a period of 
time that contained concepts that were incompatible with her own without 
gradually changing her own semantics, or consiously performing a mental 
translation whenever she used the database. 

Since the database concepts are likely to be similar to the concepts of its users, 
it is appropriate to adopt the individual theory for application to the wider 
database. 



3.2 A Formal Language for the Inclusion Rules Method 

In accordance with Klausmeier et al’s theory of concept attainment, the inclu- 
sion rules method defines the semantics of database elements (which are the 
implementation of concepts) in terms of formal level rules that the representa- 
tion of a real world entity must satisfy in order to be considered an instance of 
the element. These rules are combined into predicates that express the meaning 
of elements such that each database element is represented by a single predi- 
cate. When elements are defined in this manner, the inclusion rules can be used 
to determine the similarities and differences between the elements in different 
databases. 

For the purposes of defining and illustrating the inclusion rules method, a 
formal language has been adopted. The formal language has two separate parts. 
The first part describes how inclusion rules are specified in terms of dimensions 
and properties. These rules form the foundation for the second part, which de- 
scribes how rules can be combined into predicates to define element semantics. 

For the purposes of this definition, the standard syntactic metalanguage is 
adopted [49]. Using this metalanguage, the “=” symbol defines a name with 
its appropriate grammatical structure or well formed formula. A symbol 
terminates the definition. A “ — ” symbol indicates alternative choices for the 
make up of a well formed formula, and any symbol or variable from the language 
being specified is enclosed in quotation marks. 



Rule Definition Klausmeier et al’s model specifies that a concept is attained 
at the formal level when a person can specify the rules that she uses to include 
or exclude a newly experienced entity from the concept. These rules consist of 
what Klausmeier et al [18], as well as other researchers (for example, [4]) refer 
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to as dimensions and properties. For example, color is a dimension, and red is a 
property of that dimension in Klausmeier et al’s language [18]. 

In the formal language of the inclusion rules method, the following variables 
represent the principles discussed above: 

— Di represents a dimension; 

— Pj represents a property (which may be a single value, a range of values or 

an enumerated set of values) and 

— Rk represents a rule. 

A symbol [ ] can be interpreted as ‘is a property of’. 

The syntactic rule for well formed formulas for specifying rules in the formal 
language is as follows: 

rule= “D” “[’’“Pj” I element definition “]”; (1) 

(element definition will be defined in the next section). 

As discussed in the previous section, one of the aims of the method is to 
remove the requirement for users to have a common language. The inclusion rules 
method does this in part by defining database elements in terms of inclusion rules 
that have a standard form (a dimension and property). However, some reliance 
on language is unavoidable because the dimensions and properties are expressed 
in language. The method attempts to minimize this reliance in four ways. 

Firstly, element semantics are represented using a standard form that removes 
some of the ambiguities of natural language. Users are confined to representation 
of rules as dimensions and properties, and to combining these rules using the A 
and V operators. In this way, some of the possible ambiguity is removed. 

Secondly, inclusion rules may include not only direct values, but also refe- 
rences to other elements. The property of an element can itself be a predicate. 
The implementation of this facility is described in more detail in 4.2. 

Thirdly, a standard set of dimensions is being developed, using research into 
semantic theory. Although individuals differ in the dimensions and properties 
that they user to define concepts, it is thought that a fundamental set of di- 
mensions and properties are used as a source of these definitions. Relationships 
between the dimensions can be defined to handle redundancy, asymmetry and 
context [22]. 

Fourthly, attempts should be made to avoid assumptions about the meanings 
of terms. Dimensions and properties should be defined in the tradition of scien- 
tific research [44]. In this sense, the ability of the method to handle different 
semantics will depend on the specificity of the inclusion rules. 



Element Definition Element definitions are built up using combinations of 
rules, so no new variables are necessary. Two additional operators are used: 

— ‘A which can be interpreted as ‘and” (conjunction), meaning that the element 
is defined by both the operands and 
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~ ‘V which can be interpreted as ‘or” (disjunction), meaning that the element 
is defined by either or both of the operands. 

The inclusion rules method does not use the negation operator. The reason 
for this omission is that the use of negation in databases causes logical problems. 
If an element is defined in terms of negative rules (for example the element is 
not green in color), this implies that all the properties that the element does not 
have must be specified for the definition to be complete. This would include an 
infinite list of negative properties, and is clearly impossible. For this reason, it is 
customary to adopt the closed world assumption. This assumes that if a piece of 
information is not in the database (in this case, a rule is not in the definition), 
it is considered to be false. In other words, if it is said that an element is red or 
yellow, the fact that it is not green, blue or any other color are implied, and the 
need to explicate these facts is avoided [37], [1]. 

Well formed formulas for element definition must conform to the following 
pattern: 

element = rule \ element A element \ element V element; (2) 

Element definitions are referred to as predicates for the remainder of this pa- 
per. Each of these definitions has an implied ^x\x G element (element definition), 
to indicate that each instance of the database element evaluates to true for each 
rule in the expression. 

4 An Example 

The inclusion rules method provides a means for undertaking each step in the 
database integration process. Firstly, the method can be used by participants 
in data sharing exercises to represent the semantics of the elements in their 
databases. Following this, the method can be used to determine the semantic 
equivalence of elements, and thus whether they should be integrated or translated 
between. Thirdly, the information provided by the method can be used to resolve 
any schematic or syntactic heterogeneity between the elements that are to be 
integrated, and to merge the databases into one. This paper is concerned with 
the first two steps in the process, as described in this section with a running 
example using two relational databases. A summary of the third stage in the 
process is provided in [47]. 

4.1 The Participating Databases 

Figures 1 and 2 contain Entity Relationship diagrams for two example relational 
schemas, together with the relations produced from the diagrams. 

The elements in Schema 1 are defined as follows: Relations: 

— Block: a physical area of the surface of the earth, defined both in position 
and extent, with a bundle of property rights attached that are owned by a 
particular person. A block must exist either wholly or partly within a larger 
administrative area. 
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Fig. 1. ER Diagram and Relational Model for Schema 1 




Fig. 2. ER Diagram and Relational Model for Schema 2 
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Attributes: 

— Block JD: a unique, arbitrarily defined identifier for a block. 

— Owner: the person who has the right to use a particular block in accordance 
with the bundle of rights assigned to the relationship between the block and 
the person. The owner may only be a single individual. 

— DP JMo: the number of the plan that created the block and has been deposited 
with and registered by the jurisdiction’s land administration authority. 

Spatial Elements: 

— Block_Polygon: the physical location and extents of a block, captured using 
survey plan data entry, using an Australian Map Grid coordinate system 
defined in meters (Transverse Mercator projection) with an accuracy of +/- 
0.2m, oriented to grid north. 

The elements in Schema 2 are defined as follows: Relations: 

— Parcel: a physical area of the surface of the earth, defined both in position 
and extent, with a bundle of property rights attached that are owned by one 
or several people. A parcel must exist wholly within a larger administrative 
area. 

— Owner: the person who has the right to use a particular parcel in accordance 
with the bundle of rights assigned to the relationship between the parcel and 
the person. 

— Parcel-Owner: the assignment of a particular parcel to the ownership of a 
particular person (each parcel may have any number of owners, and each 
owner may own any number of parcels). 

Attributes: 

— ParceLID: a unique, arbitrarily defined identifier for a parcel. 

— Plan_No: the number of the plan that created the parcel and has been regi- 
stered by the jurisdiction’s land administration authority. 

— Owner JMame: the name of the person who has the right to use a particular 
parcel in accordance with the bundle of rights assigned to the relationship 
between the parcel and the person. 

— Owner -Address: the street name, number, locality and city of the owner. 

Spatial Elements: 

— Parcel-Polygon: the physical location and extents of a parcel, captured by 
digitizing a 1:1000 topographical map, using a local coordinate system defi- 
ned in feet with a Lambert Conformal Conic projection) with an accuracy 
of -\-j- 10m, oriented to true north. 
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4.2 Representing Database Elements 

Before attempts can be made to determine the semantic similarity of database 
elements, a predicate logic expression must be defined for each element in the 
participating databases. These expressions consist of rules, combined with the 
A and V operators. Complex expressions can be built up in the usual predicate 
logic manner. 

The rules themselves are contained in a rule repository, which consists of 
three tables of a relational nature. 

1. The Rule Table contains the number of the rule, which is simply the sub- 
script to the “R” variable that represents each rule, and the numbers of the 
dimension and property that make up that rule. 

2. The Dimension Table contains the number of the dimension, which is simply 
the subscript to the “D” variable that represents each dimension, and a 
phrase that defines the dimension. 

3. The Property Table contains the number of the property, which is simply 
the subscript to the “P” variable that represents each property, and a data 
value, range of data values or enumerated set of data values that define the 
property (a phrase, number, date etc.), or a predicate logic expression that 
defines an element that represents the property. 

For each element in each of the participating databases, the owner of the da- 
tabase must carry out the following steps to define the semantics of the elements 
in the database: 

1. Define a set of rules for the element that dictate whether a newly experi- 
enced entity is considered to be an example of that element (that is, define 
the inclusion rules for the database element), and create an expression to 
combine the different rules. 

2. For each rule: 

— Check whether the dimension is already in the repository. If it isn’t, add 
the dimension to the repository. 

— Check whether the appropriate property value or range is already in the 
repository. 

— If both the dimension and the property are already in the repository, 
check to see if there is a rule that matches the two. If there is, place 
the number of the rule in the predicate. If there isn’t, create a new 
rule that references the appropriate dimension and property, and 
place its number in the predicate. 

— If the either the dimension of the property (or both) are not in the 
repository, add them and create a new rule that references the ap- 
propriate dimension and property. Place the number of the new rule 
in the predicate. 

An important restriction on the addition of properties to the repository is 
that circular references must be avoided. Since properties can contain references 
to other rules, a check must be undertaken to avoid creating logic inconsistencies 
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in the rule definitions. In addition, a constraint on predicates is that they must 
not contain more than one rule that references any given dimension, as this 
would indicate a logical contradiction. If an element may have one of several 
values for a property, these can be included as a property range or enumerated 
set of values. 

Since the repository is extensible by data sharers, it will have a number of 
versions, and version numbers will need to be maintained. For each database 
element, the data sharer creates a predicate that references the rules in the 
repository. In order for database integration to be successful, it is important 
that participants in a data sharing exercise either use the same version of the 
repository, or only reference elements that are contained in the versions that are 
being used by other participants. As the repository becomes large, the need to 
extend the repository will decrease, and changes will be less likely, thus at this 
later stage it is more likely that database integration can occur successfully even 
if the versions used by the participants are slightly different. 

The Appendix A contains the three rule repository tables for the example 
introduced in Section 4.1. The predicates for each element in the schemas are 
as follows: Schema 1: 

— Block = A i ?2 A i ?3 A i ?4 

— Owner = i?s A i?g 

— DPJMo = i ?7 

— Block_Polygon = Rs A Rg A Rig A Ru A Rig 

Schema 2: 

— Parcel = Ri A Rg A Rg A Rig 

— Owner = Rg A Rg 

— Plan_No = i ?7 

— Owner JMame = Rg A Rg 

— Owner_Address = Ru A Rig A Rig 

— ParceLPolygon = i?i 7 A Rg A Ris A Rig A Rgg A Rgi A Rgg A Rgg 

Although most of the properties listed in this example contain single values, 
there are some properties that include predicates as references to other elements. 
For example, rules 5 and 17 both include a dimension that provides a reference to 
the semantic aspects of the spatial element being defined. Rule 16 also includes 
a reference to the owner of a block as a means of indicating the importance 
of an address. In a real life situation, many more of these references would be 
used to avoid undefined terminology (for example, street, residential area and 
administrative area would be defined with predicates) . 

The use of the V operator is relatively unusual. This is because definitions of 
elements do not often rely on two alternatives that refer to an entirely different 
dimension. It is more common for alternatives to refer to a single dimension, in 
which case they can be included as an enumerated set (for example, see Rule 4) . 
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4.3 Identifying Semantic Similarity 

Up to this point, the aim of identifying the semantic equivalence of elements 
has been defined as identifying database elements that represent the same real 
world entity. In the case where two database elements represent exactly the same 
real world entity (in terms of the users’ semantics of that entity), they should 
be represented by the same predicate. However, it is often the case that two 
elements may have slightly different semantics, but the user would still like to 
integrate them. Thus for the purpose of schema integration, semantic similarity is 
defined by the integrator and the purpose of the integration. If slight differences 
are of no consequence for the purpose to which the data will be put, semantic 
similarity will have a more generous definition than for other purposes. 

Using the inclusion rules method, it is possible to integrate any two database 
elements. This integration creates a database element with the semantics of the 
rules that the two integrated predicates share. If they do not share any rules, the 
created element will be infinitely general (that is, it will contain any real world 
entity). The more rules the two elements share the more specific the integrated 
element will be. The level of generality of the integrated elements will depend on 
the user’s requirements, so the inclusion rules method offers user input to allow 
control over the process. 

Each element in one database must be compared to each element in the 
other database in order to determine whether the two elements are semantically 
equivalent within the tolerances of the human integrator. This comparison relies 
on the predicates that represent the elements as a representation of element 
semantics. For the remainder of this discussion, the two predicates are referred 
to as Pi and Pj. 

The first step in comparison of the predicates is to convert them to conjunc- 
tive normal form. Conjunctive normal form (CNF) is a form where an expression 
is the conjunction of a number of conjuncts. Each of these conjuncts is a single 
variable or a disjunction of disjuncts, each of which is a single variable. Every 
expression has an equivalent in CNF. [23] discusses methods for converting pre- 
dicate expressions into CNF. Once Pi and Pj are in CNF, they can be directly 
compared. Each rule in Pi should be classified as one of the following: 

— a shared conjunct, being a conjunct, whether a single rule or a disjunction, 
that also appears in Pj; 

— a shared disjunct, being a disjunct of a disjunction that is contained in either 
a conjunct or a disjunction in Pj; 

— an unshared conjunct, being a conjunct, whether a single rule or a disjunc- 
tion, that does not appear in Pj or 

— an unshared disjunct, being a disjunct of a disjunction that is not contained 
in either a conjunct or a disjunction in Pj. 

This classification is used to determine how similar two elements are seman- 
tically. Shared conjuncts are considered a stronger indication of similarity than 
shared disjuncts, because they are compulsory rather than optional. For exam- 
ple, an element that is (red and square) is more similar to an element that is (red 
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and round) than it is to an element that is ((red or old) and round). Similarly, 
unshared conjuncts are considered a stronger indication of difference than are 
unshared disjuncts. 

The classification method described above requires that a judgment is made 
regarding similarity or difference of two rules. If two rules have the same number, 
they are obviously the same, but if they have different numbers, they may still 
be similar. Thus any rules that do not have the same number must be examined 
in more detail to determine their possible similarity. A first step in this more 
detailed examination involves comparison of the dimensions. If these dimensions 
are different, the rules are also different, so they can be classified accordingly. If 
the dimensions are the same (but the rule number different), the property values 
must be compared. 

The compared property values will not be identical (because this would mean 
that the rule repository contains identical rules), but the values in one rule may 
share some common values with the values in the other rule. In this case, the 
decision regarding whether the rules should be classified as similar (and thus 
shared) or different (and thus unshared) depends on the property type and the 
rule comparison strategy. The rule comparison strategy can be biased towards 
similarity or difference, depending on the requirements of the integration or 
translation exercise. The similarity bias assumes similarity if the relationship is 
unknown and requires only an overlap between property values in order for two 
rules to be considered similar. In contrast, the difference bias assumes difference 
if the relationship is unknown, and requires one value to be a subset of the other 
for two rules to be considered similar. 

The classification process is carried out in both directions. That is. Pi is 
compared to Pj, and then Pj is compared to Pi. 

The classification discussed above is used in order to determine the degree 
of similarity between two elements by calculating a ratio of similarity to diffe- 
rence between two predicates. This is based on the principle that more similar 
elements will probably have more rules in common. A calculated value termed 
the comparison ratio is used to encapsulate this similarity. The comparison ratio 
(CR) is a simple ratio relating the number of rules that the two element predica- 
tes share to the number of rules by which they differ. The shared and unshared 
disjuncts are given a lesser value than the shared and unshared conjuncts for 
reasons discussed above. A value of about half may be appropriate, but this may 
also be set by the user: 

CR = number of shared conjuncts -I- (number of shared disjuncts / 2) : number 

of different disjuncts -I- (number of different disjuncts / 2) for either the Pi 

to Pj comparison or the Pi and Pj comparison, whichever is larger. 

This definition of the CR assumes that rules are of equal weight. If the rules 
that the two predicates differ by are much more important than those that it 
shares, the elements represented by the predicates may seem much more similar 
than they really are. For this reason, the inclusion rules method allows the 
definer of an element semantics (in the form of a predicate) to assign weights to 
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the different rules in the predicate. If such weights are assigned, the definition 
of CR becomes: 

CR = ^ weights of shared conjuncts + weights of shared disjuncts / 2) : 
Y] weights of different disjuncts + (X) weights of different disjuncts / 2) for 
either the Pi to Pj comparison or the Pi and Pj comparison, whichever is 
larger. 

CR values can be interpreted as follows: 

— CR = 1:1 - the similarities and differences are approximately equal; 

— the first value in the ratio is much larger than the second value - the elements 
represented by the predicates are similar and 

— the first value in the ratio is much smaller than the second value - the 
elements represented by the predicates are not very similar. 

The CR values for each pair of elements can be used to automatically deter- 
mine the semantic equivalence of elements. The user can specify a similarity 
threshold, and if the ratio is greater than the threshold values, the elements 
concerned are considered semantically similar, and are integrated or translated 
between. For example, if a user specifies a similarity threshold of 1:1, any pairs 
of elements that have more similar rules than different rules will be considered 
semantically equivalent (if weights are attached, this may not necessarily mean 
that the number of similar rules will be greater than the number of different 
rules) . 



Table 1. CRs for Element Pairs 



Schema l(top), Schema 2 (side) 


Block 


Owner 


DP_No 


Block .Polygon 


Parcel 


3:1 


0:4 


0:4 


0:5 


Owner 


0:4 


2:0 


0:2 


0:5 


Plan No 


0:4 


0:2 


1:0 


0:5 


Owner _Name 


0:4 


2:0 


0:2 


0:5 


Owner .Address 


0:4 


0:3 


0:3 


0:5 


Parcel Polygon 


0:8 


0:8 


0:8 


2:6 



The CRs for the pairs of elements in Schema 1 and 2 are shown in Table 1. 
Using a similarity threshold of 1:1, the set of semantically equivalent elements is 
as follows: {UBlock, 2:Parcel} {l:Owner, 2:Owner} {UOwner, 2:OwnerJMame} 
{l:DP_No, 2:Plan_No} 

The case of the polygons is one that might benefit from the application of 
weights. Although the polygons differ in a number of representational details, the 
polygons are linked to elements in their respective schemas that are considered 
semantically equivalent (that is. Block and Parcel). Thus it may be appropriate 
to give the rule that results from the combination of i?s and Rn a greater 
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weight. If this rule is given a weight of 5, a CR of 6:6 for the BlockJPolygon - 
ParceUolygon pair is the result. This CR is within the similarity threshold, so 
that pair will be added to the set of semantically equivalent elements. 

At this stage, the set of semantically equivalent elements that has been de- 
termined by the process described in this section can be presented to the user for 
alteration or confirmation as desired. This represents a significant reduction in 
effort relative to the entirely manual method that has sometimes been used. The 
attention of the user can be directed to particular data sets for confirmation. 

From this point, the conflicts between the semantically equivalent elements 
must be resolved, and then elements can be merged. Details of this process are 
provided in [47]. 

5 Conclusion 

The inclusion rules method as described above provides a means for the repre- 
sentation of the semantics of database elements. Although this method is not 
completely isolated from the language and semantics of individuals, the method 
reduces reliance on assumptions of similarity by incorporating aspects of Klaus- 
meier et al’s [18] model of concept attainment. This includes specification of 
dimensions and properties that are used to define a concept, as well as the types 
of relationships that exist between concepts. 

Despite the persistent, if reduced, level of reliance on language, the method 
is considered useful in providing a formalized approach for expressing not only 
the differences between elements, but also the specific ways in which they differ 
in terms of dimensions and properties. This information can then be used to 
identify semantically similar database elements with a reduced requirement for 
human interaction. 

Ongoing research is testing the use of the method in real world situations 
to determine its practicality, as well as its ability to handle the wide range of 
semantic heterogeneity that exists in the spatial user community. 

The example provided in Section 4 indicates that the inclusion rules method 
is capable of identifying semantically similar elements that contain variations in 
database representation. This suggests that the method may be a useful tool 
for database integration and semantic translation of spatial data from different 
information communities. 
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Appendix: Example Rule Repository 



Table 2. Rule Table 



Rule Number 


Dimension Number 


Property Number 


1 


1 


1 


2 


2 


2 


3 


3 


3 


4 


4 


4 


5 


2 


5 


6 


3 


6 


7 


2 


7 


8 


5 


8 


9 


6 


9 


10 


7 


10 


11 


8 


11 


12 


9 


12 


13 


4 


13 


14 


2 


14 


15 


10 


15 


16 


11 


16 


17 


5 


17 


18 


7 


18 


19 


8 


19 


20 


12 


20 


21 


13 


21 


22 


14 


22 


23 


9 


23 
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Table 3. Dimension Table 



Dimension Number 


Dimension 


1 


Main object ingredient 


2 


Object purpose 


3 


Source of authority 


4 


Spatial relationship 


5 


Object defined by spatial object 


6 


Spatial object type 


7 


Capture method 


8 


Coordinate system 


9 


Accuracy 


10 


Format 


11 


Allows communication with 


12 


Measurement unit 


13 


Projection 


14 


Orientation 



Table 4. Property Table 



Property Description 

1 Earth 

2 Definition of the position and extents over assigned land use rights 

3 The legally recognized land registration authority for the jurisdiction 

4 Wholly or partly within a larger administrative area 

5 Reference to the rights of land holder to use land in accordance with 
specified relationships 

6 Legal name as shown on a birth certificate or similar document 

7 Recording of the legal survey that creates and defines a unit of earth 
in diagrammatic form 

8 Ri A R 2 A i?3 A Ra 

9 Polygon 

10 Survey plan data entry 

11 Australian Map Grid 

12 +/- 0.2m 

13 Wholly whithin a larger administrative area 

14 Facilitation of postal communication 

15 Street number, street name, residential area 

16 R<s A Rq 

17 Ri A R 2 A Rs A Ris 

18 Digitizing from 1:1000 topographic map 

19 Local 

20 Feet 

21 Lambert Conformal Conic 

22 True North 

23 +/- 10m 
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Abstract. The paper discusses issues related to the design of an inte- 
grating mediator that facilitates the use of distributed and heterogeneous 
data resources. The architecture for such a mediator is outlined in the 
paper. The implementation of relationships between distributed objects 
is presented. The design of a mediator is based on the GORBA soft- 
ware model that considers relationships between distributed objects as 
first-class objects with state and operations rather than as stored links. 



1 Introduction 

High-speed networks facilitate the exchange of data and services between infor- 
mation systems, and developments in distributed software architectures make 
it possible for heterogeneous software systems to interoperate. Complementary 
advances should be obtained in the field of data integration (not only system 
integration) and especially maintenance of integrated data sets. Mediators have 
been suggested to serve as modules between user applications and data resources, 
capturing tasks required to overcome difficulties introduced by huge volumes of 
data, heterogeneities among data resources and mismatch between data values 

[15]. 

The integration problem domain involves bringing together data items de- 
scribing some characteristic of a real-world entity emanating from different data 
sources. The system integration problem domain is mostly addressed under the 
title ’interoperability’ [1], which requires agreements on how software systems 
connect themselves to a computer network and allow interaction between compo- 
nents through well-defined interfaces. Interoperability through a common com- 
munication protocol rather than through a common data format [14] constitutes 
a starting point for an integrating mediator design in this paper. 

One aspect of importance when integrating data from autonomous databases 
is how to guarantee maintenance of the integrated data set with respect to 
changes that occur in the component data sets. Since the contents of information 
sources may change frequently, an application dependent on such resources needs 
to be aware of such changes. 
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The contribution of the work is to identify issues related to the design of 
an integrating mediator, deriving from experience gained in a GIS integration 
project. This work also demonstrates the use of existing techniques developed in 
the information technology community by implementing relationship objects in a 
distributed environment. The implementation is based on both OpenGIS Simple 
Feature specification [11] and OMG’s GORBA relationship service specification 

[9]. 

The paper is organized as follows. Section 2 describes the organizational 
setting for this work. Section 3 explains what an integrating mediator is sup- 
posed to do. Section 4 describes the implementation of relationships between 
distributed objects and section 5 discusses the meaning and implications of such 
relationships. Section 6 summarizes the topics discussed in this paper. 



2 The Case: Institutional Setting for Data Integration 

A case study was recently undertaken to gain information on factors contribu- 
ting to the mismatch between independently established and maintained building 
data [4]. The institutional setting for the study consists of 1) local authorities 
recording building permit data, 2) the Population Register Gentre database re- 
gistering different types of data concerning buildings and residences and 3) the 
building data stored in the topographic database. The building data of interest 
here consist of data describing the location of buildings, expressed as a reference 
either to a coordinate system or to road networks (addresses). A problem often 
encountered in data integration has been mismatch in data values: applications 
combining map databases with data from the population register often fail to 
perform the task automatically because coordinate values recorded for buildings 
do not fit together (Fig. 1). 

Integration of building data is performed primarily by matching coordinate 
values. Another alternative for performing the match is the building identifier, 
an official identifier assigned to each building requiring a building permit. The 
building identifier is one independent of any software system implementation. 
Its value is a string of characters consisting of various pieces of information that 
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describe the location of the building with respect to administrative and real 
estate areal divisions. Since the topographic database does not contain building 
identifiers, the corresponding information can be derived from the cadastral map. 

Building data are originally gathered in municipalities during the various 
stages of the follow-up process related to construction activity. A new building is 
assigned with an identifier, expected location data and address when the building 
permit is applied for. The municipality informs the Population Register Centre 
of the new building permit application and provides the new building data. 

The occasional low quality of the coordinate data obtained from the mu- 
nicipalities is mostly due to the fact that not all municipalities have adequate 
resources and means for gathering the location. It is envisaged here that such 
municipalities could access existing map databases through the Internet, using 
web techniques [5] to obtain reference data for determining the location data 
for the new building. Such reference data sets are already available in Finland 
through the Net [7]. 

3 Integration Problem Domain 

It is envisaged here that compatibility of data values emanating from diffe- 
rent databases could be improved by the use of software that aids in accessing 
data from various databases, establishing interdependencies between data items 
through relationships, and monitoring the validity of such relationships. 

Mediator is a software module that exploits encoded knowledge about some 
sets or subsets of data to create information for a higher layer of applications [15] . 
What exactly an integrating mediator is supposed to do is further explained in 
this chapter. The design of the mediator is based on distributed object technology 
concepts which are also shortly explained. 



3.1 Integration and Mediators 

The tasks required for performing data integration comprise: 

~ Accessing and retrieving data from multiple heterogeneous resources, using 
data export and import services [12] by the target and source information 
communities [10] 

— Abstracting and transforming retrieved data into a common representation 
and semantics, thereby homogenizing any differences between the underlying 
data representations [3] and information content [2] 

— Combining the data items according to matching values. 

Given a set S of local database objects Ou, the mediator should be able to 
create an integrated object i?i that holds information from some elements of S. 
The mediator accesses Oli and imports them to the mediator’s address space, 
creates appropriate relationship objects R[, and monitors the state of each i?i 
with respect to changes occurring in Oli- 
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Ou are integrated to form i?j if they fulfill a condition that has been as- 
sociated with the definition of i?i. The relationship object R[ is created by the 
mediator if the required condition is fulfilled. The tasks required from our inte- 
grating mediator are summarized as: 

1. Creation of the relationship object 

— determine the condition to be fulfilled by the component objects (e.g. 
building geometries emanating from different databases must coincide) 

— retrieve local objects 

— determine whether the condition between retrieved objects is fulfilled 

— create relationship object 

2. Monitoring of the relationship object 

— local objects should notify the mediator when a change has occurred in 
them 

— determine whether the condition still holds after change has occurred in 
the component 

3.2 Distributed Object Technology Supporting Integration 

A mediator of the type discussed above can be based on existing software ar- 
chitectures designed for integrating heterogeneous, autonomous and distributed 
computing resources. Middleware technologies facilitate integration of compo- 
nents (objects) in a distributed system by providing a run time infrastructure 
of services for components to interact with each other despite differences in un- 
derlying communications protocols, system architectures, operating systems and 
other application services [13]. 

An example of middleware technology is CORBA [8]. A CORBA object sy- 
stem is a collection of objects that isolates the requesters of services (clients) 
from the providers of services by a well-defined encapsulating interface. Clients 
request services by issuing requests for services that are implemented on the 
server, and called using an object reference. CORBA relationship service archi- 
tecture [9] provides a mechanism for associating isolated objects and managing 
their creation, navigation and destruction. A mechanism for associating indepen- 
dently implemented and maintained objects is used as a basis for an integrating 
mediator. 

The CORBA relationship service defines relationships as first-class objects, 
i.e. objects that have a state, and that may have operations. A relationship object 
groups together CORBA objects that are related to each other and defines the 
semantics of the relationship in terms of the degree of the relationship, the types 
of object that are expected to participate in the relationship and cardinality 
constraints. A role represents a CORBA object in a relationship and provides 
mechanisms for navigating the relationship [9] . 

The relevant portion of the CosRelationships module [9] for purposes of the 
present work is given below. The client acquires the roles involved in a relations- 
hip with the aid of the named-roles attribute defined in the relationship interface. 
The method get-otherjrelated-object defined in the role interface provides access 
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to the related object. Another role object involved in a particular relationship is 
accessed by the get-other-role method. 



interface Relationship : CosDbjectIdentity : : Identif iableObject { 
readonly attribute NamedRoles named_roles; 

>; 

interface Role { 

readonly attribute RelatedObject related_object; 

RelatedObject get_other_related_object (in RelationshipHandle , 

in RoleName target_nEmie) ; 
Role get_other_role (in RelationshipHandle rel, 

in RoleName target_name) ; 

}; 



4 Implementing Relationships between Building Objects 

4.1 An Integrating Mediator 

This section describes a simple implementation on which more functionality 
aiding in the management of integrated objects can be built. An integrating 
mediator is envisaged as an application that accesses distributed building ob- 
ject implementations, performs matching between object values to identify the 
objects that are to be associated with the relationship service and aids in main- 
taining the relationship (Fig. 2). The relationship object integrates object values 
from the component objects and thus provides an integrated view of the distri- 
buted building objects to the client. 

An abstract service domain has been assumed here as the basis for the work. 
The mediator invokes services on remote objects (registered in the network and 
thus made available to clients) to acquire their values. Services required to imple- 
ment such a system include access to the remote object, exporting data from the 
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remote object and importing data to the mediator’s address space. The integra- 
ting process finds a match between object values and establishes the relationship 
service. 

The following sections in this paper describe how the relationship service 
is established in this setting. The objects to be associated by the relationship 
service are modelled as OpenGIS simple features [11]. Java classes were imple- 
mented to instantiate the following object types: 

— Factory objects for creating feature types and features [11] 

— Features [11] 

~ Role and relationship factories [9] 

— Roles and relationships [9]. 



4.2 Related Objects without Relationship Service 

The objects to be associated by the relationship service are digital representa- 
tions of real-world buildings. The representations conform to the following IDL 
interface specifications: 



interface TopoBuilding : OGIS :: Feature {}; 
interface MunBuilding : OGIS : :Feature {}; 

TopoBuilding and MunBuilding object types represent the building data sto- 
red in the topographic database and the building permit database, respectively, 
as was explained in Sect. 2. The OGIS module specification is given in [11]. 

In the event that the features were modelled as those not inheriting from 
the OpenGIS Feature, the integrated feature schema capturing the properties 
central to the current integration task would be: 

interface TopoBuilding { 

readonly attribute string Buildingid; 
readonly attribute string Use; 

readonly attribute OGIS :: Geometry PolygonGeometry ; 

>; 

interface MunBuilding { 

readonly attribute string Buildingid; 

readonly attribute string Address; 

readonly attribute OGIS :: Geometry PointGeometry ; 

>; 



A client that collects data to implement an integrated building object in, 
say, a population register application, might have the following local model of a 
building: 
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interface PopBuilding { 

readonly attribute string Buildingid; 
readonly attribute string Address; 
readonly attribute OGIS :: Geometry Geometry; 

}; 

The purpose of the client application is to access available objects of Topo- 
Building and MunBuilding types and find a match between either Buildingid 
value or between PointGeometry and Polygon Geometry values. Once a match is 
found, the client copies values to the PopBuilding object from TopoBuilding and 
MunBuilding objects. If there were no relationship service, i.e. no integrating 
mediator available, the client would make requests for objects implementing the 
given interfaces and perform the integration in the local application address 
space. 

4.3 Associating Building Object Implementations Using the 
CORBA Relationship Service 

CORBA Relationship Service Architecture models relationships as first-class ob- 
jects (with identity), whose components are named roles. Roles represent related 
objects, in this case, TopoBuilding and MunBuilding types. Relationship objects 
are designed and maintained by the integrating mediator for the purposes of 
providing the client with an integrated view of the isolated building objects. 
The client accesses the integrated relationship object and is not responsible 
for performing the access and matching of values of the component object its- 
elf. Separate component object implementations hold implementation-dependent 
identities and have a property value that can be used for matching the objects 
{Buildingid attribute). 

The component objects, TopoBuilding and MunBuilding objects are related 
by the mediator to form a relationship object of the IntegratedBuilding type gi- 
ven below. The object simply associates TopoBuilding and MunBuilding objects. 
Roles corresponding to these objects are named TopoRole and MunRole. 

interface TopoBuilding : OGIS : :Feature {}; 
interface MunBuilding : OGIS : :Feature {}; 

interface IntegratedBuilding : CosRelationships :: Relationship {}; 

Table 1 lists the types of object that are involved in the resulting relationship 
object. The relationship semantics is given by cardinality constraints saying that 
no role object may exist without belonging to a relationship object; on the other 
hand, the role object is associated with exactly one relationship object. 

4.4 Registering Object Implementations with the ORB and Naming 
of the Objects 

The central component in OMG’s CORBA is the ORB (Object Request Bro- 
ker), which makes it possible for objects to communicate with each other with 
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Table 1. Roles Participating in the Integrated Building Relationship 



Related object 


Role type 


Cardinality (min, max) 


TopoBuilding 


TopoRole 


(1,1) 


MunBuilding 


MunRole 


(1,1) 



no prior knowledge about the location or implementation details of the other ob- 
ject. The objects that are registered with the ORB can be accessed by resolving 
the names bound with the objects. Names are structures consisting of an iden- 
tifier and ’kind’ describing the type of object. The client needs to know about 
the semantics associated with each name [9]. The Naming Service provides the 
principal mechanism through which most clients of an ORB-based system locate 
objects that they intend to use. When given an initial naming context, the client 
can navigate through the naming contexts and retrieve lists of the names bound 
to that context. 

A simple naming scheme was used here: the relationship object name iden- 
tifier contains the BuildinglD value for clients to identify the correct building. 
Designing an appropriate naming scheme for the environment [6] is a key que- 
stion when designing an integrating mediator. 

5 Relationships: Meaning and Implications 

The purpose of the integrated relationship object {IntegratedBuilding) is to asso- 
ciate existing building objects for prospective clients to access their values in an 
integrated fashion. A client is assumed to be interested in viewing and copying 
values of the integrated object. The mediator provides services that provide the 
client with the necessary information on the object so that the client can use it. 

5.1 Client Scenarios 

The client knows of relationships (integrated objects) through the names that 
have been bound with the naming context in the ORB and their semantics 
through a catalogue service associated with the mediator. The primary purpose 
of the catalogue service in this context is to support identification of registered 
relationship objects. 

A client may simply view the implemented relationship (and its values) or 
copy the values to the client’s local data store. From the point of view of client 
accessing an object whose state is in agreement with real-world entities, it is 
of relevance to think of the validity of the object. When the client accesses 
the relationship object through the mediator it is always up to date (as to the 
mediator’s knowledge). If the client stores the relationship locally, the object 
no longer remains under the control of the mediator, and state changes in the 
relationship are not mirrored in the client database. 
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The client navigates through the relationship object with the aid of roles 
which can be accessed by the relationship object’s named-roles method. Those 
clients having an object reference to an object may wish to know whether that 
object is related to another object. This can be accomplished by determining all 
the registered relationship objects, roles participating in them and comparing 
the object references of the objects represented by roles. 

5.2 Modelling and Management of Relationships 

Once a relationship object has been created, it is possible that the related com- 
ponent objects change in a way that makes the association no longer valid. And 
even though the association does not lose its validity, it may be of importance 
to inform the client about the change. To design the activities of the media- 
tor service, a model of the relationship life cycle is required on which to base 
their management. Based on the object life cycle model, event and triggering 
mechanisms monitoring object state changes can be designed. 

Data management consists of the activities of defining, creating, storing, 
maintaining and providing access to data and associated processes in one or more 
information systems [12]. We may assume the existence of a data management 
environment in terms of relationship controller services responsible for 

— Adding and modifying relationship definitions 

— Adding, modifying and deleting relationship instances. 

In the case examined here, possible changes in real-world buildings that must 
be accounted for include: 

— Change in the attribute value that was used in matching the component 

objects 

— Change in existence of the component object. 

For example, the implementation described in Sect. 4.3 used the building 
identifier value to match component objects. The value is subject to change 
when changes occur in real estate division (e.g. two real estates are merged). The 
implementation also used the building identifier as a basis for creating names 
of the objects, the knowledge of which the client is assumed to rely on when 
accessing them. Another type of change that can occur is one in which the real- 
world building ceases to exist, thereby making existence of the whole relationship 
object meaningless. 



6 Summary and Conclusion 

We have reviewed issues related to integration of data emanating from various 
sources and outlined functionality of an integrating mediator that integrates 
data and manages the integrated data set. An experiment has been carried out in 
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which the CORBA relationship service and OpenGIS simple feature specification 
was used as the basis for designing the mediator. 

This work has been motivated by experience gained when carrying out an 
integration task in which building data were integrated. Without the mediator 
service, users have to invoke various applications when integrating data sets. 
This is not considered to be appropriate. It has been demonstrated here that 
distributed object technology offers techniques that can be used to build appli- 
cations that provide the end user with a more integrated view of autonomous 
data resources. 

This paper has demonstrated the use of the CORBA relationship service in 
forming associations between distributed objects. The next step in managing 
the implemented relationship objects would be to use event services to monitor 
and invoke further actions when encountering state changes in related objects. 
One key problem area that was identified is the naming scheme that needs to 
be established to identify objects between organizations, a problem area that is 
currently being discussed in the OpenGIS community. 
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Abstract. In order to improve the spatial information infrastructure 
in Japan, we have organized SFCO (Spatial Information Infrastructure 
Interoperability Consortium), and we are newly developing a Japanese 
interoperable test-bed based on OGIS. In this system, we propose the 
new three tier model which is composed of web clients, legacy database 
wrappers, and GSM (Geo Spatial Mediator). Especially GSM locates bet- 
ween client and wrappers, and can compensate spatial objects. Moreover, 
we propose container- based fast transfer interface of spatial objects as 
for the GORBA implementation. 



1 Introduction 

As well known already, many NSDI (National Spatial Data Infrastructure) activi- 
ties or projects have started in the world. As the example in the United States, 
OGC (OpenGIS Gonsortium) WWW Mapping SIG is promoting OGIS (Open 
Geodata Interoperability Specification) based interoperable test-bed project now 
[1]. On the other hand, GIPSIE (GIS Interoperability Projects Stimulating the 
Industry in Europe), which is newly organized as the industry-university joint 
project, started to support the European standard interoperable GIS [2]. 



1.1 GIS Situation of Japan 

In contrast to these activities in the world, the importance of GIS was not 
enough considered by the Japanese central government traditionally. But 18 

* We would like to thank the SECO members for developing this system. There are 29 
contributors from two universities (Tokyo and Keio), eight companies (Asia, Falcon, 
Hitachi, IBM, Kokusai, NEC, Oki, and Pasco) and NSDIPA. There are too many 
people to list here, but we would like to thank these contributors. 



A. Vckovski, K.E. Brassel, and H.-J. Schek (Eds.): INTEROP’99, LNCS 1580, pp. 265—276, 1999. 
© Springer- Verlag Berlin Heidelberg 1999 
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inter-ministerial liaison committees have recently started taking advantage of 
the Hanshin-Awaji earthquake disasters. Moreover, the importance of interope- 
rability of legacy geospatial databases is not acknowledged yet. 



1.2 Project Motivation 

We will therefore construct a new geospatial information infrastructure by using 
distributed object technology connecting various information communities. First, 
we create the spatial object model that enables interoperable utilization of ge- 
ospatial data supported by different information communities. Second, we deve- 
lop a distributed object environment which enables information supply to web 
terminals by interoperable utilization of legacy geospatial databases supported 
from each information communities. Our motivation for these activities is how to 
construct the middle ware for the GIS software environment. This middle ware 
should provide the following functions. 

— Effectively sharing and circulating the geospatial information based on sim- 
ple feature specifications. 

— Quick and high-quality user interface 

— Easy interoperability between legacy databases 

2 The Japanese Interoperable GIS Test-Bed 

With the aim of providing the above-mentioned functions, we organize SPCO 
(Spatial Information Infrastructure Interoperability Consortium) with two uni- 
versities, eight companies, and NSDIPA (National Spatial Data Infrastructure 
Promoting Association in Japan). And we are newly developing a Japanese in- 
teroperable GIS test-bed based on OGIS. This project is promoted as a national 
project and is financed by the quasi-governmental organization IPA (Information 
Technology Promotion Agency). 



2.1 Purpose of System Development 

The application fields for geospatial information in Japan are split into two ca- 
tegories: consumer use or professional use. A consumer application includes the 
road-route guidance system known as car navigation. This system usually utilize 
the middle-range precision road map of the 1/25,000 - 1/10,000 scale. And these 
road maps are stored on CD-ROM and are accessed from a stand-alone-type 
navigation system. 

On the other hand, professional applications include municipal administration 
systems such as disaster protection systems and environmental preservation sy- 
stems, Another professional applications are the facility management systems in 
utility companies. These applications usually utilize the high-precision residen- 
tial maps of the 1/500 - 1/2,500 scale. 
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This test-bed project handles professional use of the geospatial information in- 
frastructure shared among various divisions. And the purpose of test-bed con- 
struction is to confirm whether is it possible to construct a practical use system 
by utilizing high-precision geospatial data and the system based on the OpenGIS 
simple feature specification. 

2.2 Problems in System Development 

The FGDC (Federal Geographic Data Gommittee) demonstration system, which 
is now being developed by the WWW Mapping SIG, adopts a two-tier model. In 
this system, legacy databases managed by different GIS vendors such as Bentley, 
Intergraph, ESRI, and ORAGLE are integrated by three kind of DGP (Distri- 
buted Gomputing Platform): GORBA, OLE/DB, and SQL. And this system 
enables overlay display on web terminals via Open Map IDL [3] [4] [5] . The data 
handled by this system is composed of relatively rough-scaled geospatial data 
such as municipal boundaries, road routes, tin, and coverage [6]. 

On the contrarily, Japanese domestic requires not only high precision geospa- 
tial data but also utility data such as electric power lines and water supply and 
sewage lines. In such cases, the following problems arise. 

1 . Geographic data and facility data are usually supplied from the each different 
legacy database. And two kinds of object difference can occur even in simple 
overlay processing. One is physical difference such as coordinate rotation or 
aberration; the other is semantic difference such when the same object is 
different even though the name title is the same. 

2. The construction cost of high-precision geospatial data is very high, so it 
will be impossible to maintain a wide area. A compensation technique is 
therefore required, in which areas lacking high precision are compensated 
seamlessly by rough-precision data. 

3. The common clearing house function mainly supports searching network 
addresses of web sites. The properties of these web sites are matched with 
user selected properties. But when each spatial object contains physical or 
semantic differences, these clearing house functions are not sufficient. So an 
autonomous trading function must search more profitable web sites if the 
obtained and evaluated spatial object contains large differences. 



2.3 System Organization Concepts 

In order to solve the above mentioned problems effectively, we propose a new 
three-tier model composed of web clients, legacy database wrappers, and a GSM 
(Geo Spatial Mediator). The GSM locates between clients and wrappers, and 
it compensates spatial objects. The block diagram of our proposed three-tier 
interoperable system is shown Fig.l. The bottom level is the wrapper part which 
converts the retrieval result from legacy databases into spatial objects based on 
the OpenGIS simple feature spec., the middle level is GSM which composes tree 
kinds of spatial object processing functions such as composer, compensator and 




268 



S. Shimada and H. Fukui 



trader; and the highest level is the web clients which support high-speed display 
of spatial objects. 



Portrayal Model Correspondence If we compare our proposed architecture 
with the portrayal model adopted by the WWW mapping SIG, the result is 
as follows. The portrayal model clusters the pipeline process into two groups: 
one is the process from legacy database retrieval to feature extraction as one 
component; and the other is the process from feature transportation to graphical 
display as three-detail components. On the other hand, our proposed three-tier 
model clusters pipeline processes into two groups: one is the former process as 
two components and the other is the latter process as one component. The latter 
display process is regarded as the deep client system. For implementing GSM, 
it will be possible to support any level of interfaces from the thin client level to 
the deep client level depending on user demands. This user adaptable function is 
supported by GSM’s active retrieval mechanism. And detail process flow of this 
mechanism is shown Fig. 2. However, it will be possible that, the Alter component 
of the portrayal model can be treated as a more precise process in the case of 
intersecting GSM. 



2.4 GSM (Geo-Spatial Mediator) 

The characteristics of our proposed system function as the middle level of GSM. 
Generally speaking, the mediator role under the heterogeneous database retrie- 
val environment has been already formulated by various studies. For example, 
TSIMMIS project has offered a data model and a common query language that 
are designed to support the combining of information from many different sources 
[7] [8]. But this formulation is not enough concrete for the geospatial information 
processing. 

To cope with this situation, our proposed GSM architecture supports spatial 
object processing functions specifled by the OpenGIS Spec. Topic 12: ’’The 
OpenGIS Service Architecture” except for image processing to solve the above 
problems of first and second. For example, GSM partially implements ’’Geospa- 
tial Goordinate Transformation Services” and ” Geospatial Feature Manipulation 
Services” . And GSM supports trader functions, which can search web sites, ba- 
sed on meta-data in order to solve problem of third. Moreover, these functions 
are organized each are split into three categories. 

Control of distributed retrieval If GSM receives a retrieval demand from web 
terminals, it orders the trader to search web sites which hold the most suitable 
spatial objects. Then GSM controls the compensator to check the difference bet- 
ween the composed spatial objects. And GSM actually processes the obtained 
spatial object archives such as retrieval, composition, and conversion. GSM has 
following two functions.. 

Spatial object trading Search the location of legacy databases for the most sui- 
table spatial object according to the trader graph which the server holds. 
Spatial object compensation If the application object is composed of plural spa- 
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Fig. 1. Proposed three-tier model for Japanese test-bed based on OGIS 



Map Objects is a trademarked object oriented CIS environment produced by ESRI. 
SDE is acronym of spatial data engine produced by ESRI. 

GenaMap is a trademarked GIS produced by GenasysII Corp. 

GeoProvider is a interoperable GIS tool produced by Hitachi Software 
Engineering Co., Ltd. 

Object Spinner is a trademarked ODBMS produced by NEC Corporation. 

VisiBroker is a trademarked CORBA produced by Imprise Corporation. 

TPBroker is a trademarked CORBA, of which object-trainsaction mcinagement is 
added to VisiBroker, produced by Hitachi, Ltd. 
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tial objects simply, physical differences caused by the difference between coor- 
dinate systems as well as semantic differences caused by the same objects with 
different names still exist. So GSM is equipped with compensation functions to 
regulate these object differences. 

Among these GSM functions, we focus on the characteristic functions, trader 
and compensator, as explained below. 



Geospatial Object Trading Function We developed meta-data oriented 
searching functions similar to common clearing house functions. These functions 
can grasp information about where a spatial object is and what kind it is. In 
this case, the searching function must consider the special relationships among 
geospatial information. These special relationships result from the scale hierar- 
chy or the difference of media style such as vector or image. So it is necessary 
to develop following hierarchical search method based on a trader graph. We 
therefore developed a circular information infrastructure. This infrastructure is 
composed of network connected servers, in which GSM is the center and this 
builds an individual community. The trading function is the core technology for 
autonomous search between GSMs. 

Geospatial Object Gompensation In our test-bed system, we try to access 
the high-precision geospatial data, so we suppose that various differences will be 
contained in spatial objects. By using interoperable composition methods, these 
spatial objects are directly supplied from legacy databases. Spatial objects are 
optimized for ordinary usage in each community. Realtime compensation me- 
chanisms of spatial objects are therefore prepared as one service menu of GSM. 
These services are split into physical and semantic compensations as follows. 
Physical compensation 

As already mentioned, there are physical differences between spatial objects sup- 
plied from legacy databases. These differences are caused by coordinate rotations 
or shifts. Gompensation is thus composed of correcting processes for two kinds 
of errors. One is the transformation error between geodetic coordinates and or- 
thogonal coordinates. And the other is the shift error among coordinate systems 
with different scales. The latter case is mainly caused by offset of the origin 
coordinates which belong to the Japanese standard 19 coordinate systems. 

The positional distortion can be compensated by adding transformation process 
to standard coordinate systems or standard geometrical models. This compen- 
sation process is specified in OGG spec. 12: ’’Geospatial Goordinate Transfor- 
mation Services”. 

Semantic compensation 

In order to compose spatial objects while keeping semantic consistence, seman- 
tic conversion based on the definition of terminology and schema structure used 
in each community is needed. In our test-bed system, semantic conversion of 
structure means that feature names, properties, and structure between spatial 
databases are compensated as shown in Fig. 3. This figure shows the feature cor- 
respondence between the residential map, the summarized map, and the disaster 




Si^CO Test-Bed 



271 



prevention. We thus achieved compensation processing and based on the OGC 
service architecture spec., such as ’’Feature Manipulation Service”, ’’Feature Ge- 
neralization Service”, and ’’Feature Analysis Service”. 

This semantic compensation is expected to become more complicated, but it will 
become more domain specific. 



3 Implementational Situation 

Our test-bed system is now implemented by GORBA based on a three-tier model. 
In this implementation, we utilize VisiBroker which is GORBA 2.0 based on 
ORB products supported by Inprise Gorp. Regarding the ORB products, there 
are already many products announced, such as lONA’s Orbix, and we adopt 
VisiBroker in our implementation. This is because VisiBroker is adopted as the 
IDL-Java mapping in OMG, and moreover, as the standard GORBA interface 
in Netscape Gorp. 

The greatest improvement in this implementation is the proposed new method 
for transferring features. The OpenGIS Simple Feature Specification for GORBA 
specifies the method as for transferring features in each feature unit. But transfer 
speeds are very slow when transfering heavy geospatial data loads. So we propose 
” GontainerFeatureGollectionSet Interface” which enables extremely fast transfer 
of a set unit of features by one HOP protocol communication. 



3.1 ContainerFeatureCollectionSet Interface 

First, we specify GontainerFeatureGollectionSet in order to express the set of 
features. The top level content of internet geospatial-data format (IGF ^ ) is the 
collected unit of a plural common layer. That is, collection of the same kind of 
features as a layer and collection of layers as a container unit. Second, we explain 
the two newly supported operations from the GontainerFeatureGollectionSet in- 
terface as follows. 

~ Spatial retrieval operation 

(get_ContainerFeatureCollectionSet_by .geometry): Describe the retrie- 
val area as the retrieval condition and get the set of features contained in 
the specified area. 

— Property retrieval operation 

(get_ContainerFeatureCollectionSet_by .property): Describe the retrie- 
val condition for the attribute information and get the set of features which 
hold the properties matched with the retrieval condition. 

The following IDL program is a concrete description of the above mentioned 
GontainerFeature-GollectionSet. 

^ IGF is acronym of internet geospatial-data format settled by Hitachi Software En- 
gineering Co., Ltd. 




272 



S. Shimada and H. Fukui 




Image 



Display 

Elements 



Features 



Data 

Source 



WWW GSM 

Server 




Fig. 2. Process flow of GSM’s active retrieval mechanism. If the terminal capability 
is sent to the GSM, it is evaluated and dispatched spatial retrieval functions Then 
retrieved data is actively converted into various forms by the media conversion function 
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Fig. 3. Diagram of feature correspondence among DB communities. Semantic struc- 
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//ContainerFeatureCollectionSet InterFace 
module CFCSIF-C 

typedef string Istring; //Extracted portion of OGIS CORBA spec, 

typedef sequence <Istring> IstringSeq; // 

typedef sequence <octet> OctetSeq; //Following is SICO supplement 

struct RetrievedFeatureType { 

Istring ftype_name; //Name of Feature Type 

Istring condition_prop_name ; //Name of Condition 

any prop_value; //Value of Property 

Istring condition; //Condition (eq,ne ,ge , le ,gt ,lt , like , etc . ) 

>; 



typedef sequence<RetrievedFeatureType> RetrievedFeatureTypeSeq; 



//Interface Definition 
interface ContainerFeatureCollectionSet { 

//Spatial Retrieval Operation 

void get_ContainerFeatureCollectionSet_by_geometry ( 

in Istring coordinate_system, //Kind of Coordinates "GEOGCS" or "PROJCS" 
in Istring coordinate_name , //Name of Coordinates "BESSEL" or "JA19-9" 



in short unit, 

in double x, 

in double y, 

in double w, 

in double h, 

in long scale, 

in RetrievedFeatureTypeSeq 

out OctetSeq geodata 



//Orthogonal: Unit (IxlOn cm) , Geodetic : null 
//Center of X Coordinate 
//Center of Y Coordinate 
//Width 
//Height 

//Display Resolution 

ftype_name_list ,//Name of FeatureType + Condition List 
// IGF Typed Geospatial Data 



//Property Retrieval Operation 

void get_ContainerFeatureCollectionSet_by_property ( 



in RetrievedFeatureType ftype, 
in Istring request_prop_name , 
out OctetSeq geodata 



//Name of FeatureType+Retrieval Condition 
//Name of demcinded properties 
//IGF Typed Geospatial Data 



>; 



3.2 Web Client Implementation 

To widen the market of business applications without compromising the ability of 
application programs, we construct the common portion of various applications 
as the plug-ins and construct their different portions as the application applets. 

Application applets. For example, we suppose that these applications are 
utilized in a disaster prevention system used in municipal government. And we 
implement a ’’Refuge planning system”, which needs a middle-scale range and 
wide area maps, and ’’Dangerous place management system”, which needs nar- 
row but precise-scale range maps, such as Java applets. 

Plug-ins. We developed the following functions which enables high speed dis- 
play and scrolling on the web terminals. And these operations are effective even 
for high precision maps: 

Scroll function: Scroll quantity management, pixel thinning during fast scrol- 
ling, estimate scroll direction, and interlocking scrolling of two windows. 
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Zoom in/out: Automatic selection of graphic information fit for display capa- 
bilities. 

Layer on/off control: Control function of layers according to applet demand, 
and on/off control of display according to display limit under the selected 
scale conditions. 

Split control of display: Display management that splits display into two 
windows, displays two kind of maps in the same area, and merges them 
into one window. 



3.3 GSM Implementation 

Physical compensation mechanism of GSM is implemented as uniform geome- 
trical processing composed of coordinate transformation based on the property 
of legacy databases. On the other hand, semantic compensation must be im- 
plemented based on the OpenGIS service architecture, but the details of this 
specification are not settled yet. So, in this case, we implement the function 
which conceptually adjusts geo-spatial objects on the disaster prevention ap- 
plication area. We implement the ’’Feature Type Translation Function”, which 
translates the name and the structure of geo-spatial objects, and the ’’Gompo- 
sition Function” which relates and composes objects semantically the same as 
each other. 



Feature-type translation function. This function absorbs schema differences 
between legacy databases by interoperabl transforming geo-spatial object names 
and structures between each database. The OpenGIS service architecture shows 
the concept of object translation and translates and transforms object schema 
between each community. But in the real world, object relationships are not 
a simple one-to-one relation, but they are M-to-N relations composed of ”is- 
a” or ”part-of’ relationships. In order to transform these M-to-N relationships, 
we implement the feature type name translate function and the data structure 
conversion function. 

In this translate function, the object name is managed by the geospatial 
object relationship table which enables the name used in each databases to be 
obtined, and any object names from this table to be easily accessed. On the 
other hand, in the data structure conversion function, the name of the object is 
also accessed from the geo-spatial object relationship table and the structure of 
the accessed objects is converted into the corresponding structures according to 
the feature type management table. 

Composition function. Geo-spatial objects, which are acquired from plural 
legacy databases, are not directly supplied to the applications. But the composed 
results are supplied. And they are obtained from the following processing. First 
relationships are resolved from synonyms between geometrical data and attri- 
bute data which are both held by geo-spatial data. Second, the new geo-spatial 
objects are composed of relationships and they are supplied to the application. 
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Judgement of synonyms between objects is based on the following operations, 
and the target of these operations is the geometrical and attribute data held by 
geo-spatial objects. 

— Synonym judgement based on searching for geometrical similar processes. 

— Synonym judgement based on matching text string partial process. 



3.4 DB Wrapper Implementation 

We implemented four database wrappers, and each wrapper is developed by dif- 
ferent venders. Properties of the geo-spatial database and native GIS supported 
by each venders are summarized in the following table. 



Table 1. DB Wrapper and Assigned Vendors 



DB-Name 


Scale 


Contents 


GIS&DBMS 


Assigned Vendor 


Residential 

Map 


1/1,500 


Digital Map related with Hou- 
seholder Name 


Original DBMS 


Oki Electric Corp. 


Summarized 

Map 


1/25,000 


Outline Map composed of Bo- 
undaries, Roads and so on. 


MapObjects-|-SDE 

ORACLE 


PASCO Corp. 


Aerial 

Photo 


1/7,500 


Orthographic Aerial Photo- 
Image 


ORACLE 


Kokusai-Kogyo 

Corp. 


Application 

Map 


any 


Various Subjects focused on — 
Disaster Prevention 


ObjectSpinner 

(ODBMS) 


NEC Corp. 



These retrieved results from legacy databases are wrapped into the simple 
feature level geo-spatial objects, and they are transferred to the upper level 
of GSM via the HOP protocol by the previously mentioned GontainerFeature- 
GollectionSet interface. 



4 Conclusions 

We have developed the SI^GO architecture as the Japanese interoperable GIS 
test-bed system based on the Open GIS simple feature specification. And we 
proposed the new three-tier model composed of web clients, legacy database 
wrappers, and GSM. In GSM, we try to accomplish trading and semantic com- 
pensation in geospatial infrastructures. Moreover, these architectures are now 
being implemented by GORBA. And we have successfully implemented a mas- 
sively fast transfer protocol of spatial objects named ” GontainerFeatureGollec- 
tionSet Interface”. 

In future, we try to realize an object transaction mechanism for distributed data- 
bases and an agent based asynchronous object concurrency control, by adopting 
the function of geospatial object repository in GSM. 
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Abstract. In order to make database systems interoperate with systems 
beyond traditional application areas a new paradigm called “exporting 
database functionality” as a radical departure from traditional thinking 
has been proposed in research and development. Traditionally, all data 
are loaded into and owned by the database, whereas according to the new 
paradigm data may reside outside the database in external repositories 
or archives. Nevertheless, database functionality, such as query proces- 
sing and indexing, is provided exploiting interoperability of the DBMS 
with the external repositories. Obviously, there is an overhead involved 
having the DBMS interoperate with external repositories instead of a 
priori loading all data into the DBMS. In this paper we discuss alterna- 
tives for interoperability at different levels of abstraction, and we report 
on evaluations performed using the Concert prototype system making 
these cost factors explicit. 



1 Introduction 

Todays Database Management Systems (DBMS) make the implicit assumption 
that their services are provided only to data stored inside the database. All data 
has to be imported into and being “owned” by the DBMS in a format determined 
by the DBMS. Traditional database applications such as banking usually meet 
this assumption. These applications are well supported by the DBMS data mo- 
del, its query and data manipulation language and its transaction management. 
Advanced applications such as GIS, CAD, PPG, or document management sy- 
stems however differ in many respects from traditional database applications. 
Individual operations in these applications are much more complex and not ea- 
sily expressible in existing query languages. Powerful specialized systems, tools 
and algorithms exist for a large variety of tasks in every field of advanced appli- 
cations requiring these systems to interoperate and make their data available to 
other systems. 

Because of the increasing importance of advanced applications, DBMS de- 
velopers have implemented better support in their systems for a broader range 
of applications. Binary Large Objects provide a kind of low-level access to data 
and allow individual data objects to become almost unlimited in size. Instead 
of storing large data objects in BLOB’s, some newer systems such as ORACLE 
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(Version 8) and Informix (Dynamic Server with Universal Data Option) provide 
the BLOB interface also to regular operating system files. Because the large ob- 
jects in any of these two options are uninterpreted, database functionality for 
this kind of data is only very limited. In order to better support advanced ap- 
plications, the standardization effort of SQL3 specifies, among others, new data 
types and new type constructors. Most recently, SQL3 and object-orientation 
have fostered the development of generic extensions called datablades [8], car- 
tridges [11], and extenders [7]. They are based on the concept of abstract data 
types and often come with specialized indexing. 

Although they provide better support for advanced applications, however, 
except for the file system case, they all have the same fundamental deficiencies: 
First, it is the DBMS together with its added extensions that prescribes the data 
structure and data format of the data to be managed. The consequence is that all 
complex specialized application systems and tools must be rewritten using the 
data structures enforced by the DBMS, or at least complex transformations must 
take place to map the DBMS representation into the application representation. 
Second, the DBMS owns the data. All data has to be physically stored inside 
the DBMS requiring to possibly load gigabytes of data into the database store. 

These observations led to a radical departure from traditional thinking as 
it is expressed in [19]. In the Concert project at ETH, we focus on exporting 
database functionality by making it available to advanced applications instead 
of requiring the applications to be brought to the DBMS. In [14] and [15], we 
presented the concepts needed to enable the DBMS to interoperate with external 
data repositories exporting its functionality to data stored outside the DBMS. 
Query processing and indexing is performed by generic methods of physical da- 
tabase design invoking operations of user-defined abstract (external) data types. 
In this paper, we identify different levels of abstraction, at which interoperability 
for query processing can take place, and we present performance measurements 
identifying the costs required for the additional flexibility. With the exception 
of [18] we are not aware of other work that deals with external data and related 
performance measurements. 

This paper is organized as follows: Section 2 introduces the two possible 
levels of abstraction, at which interoperability of the Concert query engine 
with external repositories can take place. Section 3 discusses the lower level of 
abstraction exploiting the integration of abstract data objects in the Database 
Kernel. In Section 4 we present Concert’s object manager “Harmony”, which 
adds higher level query processing capabilities. Section 5 concludes. 



2 The Concert Architecture 

In Concert, interoperability can take place at two different levels of abstrac- 
tion corresponding to the Concert system architecture. Concert consists of a 
database kernel system and a generic object manager. Figure 1 gives an overview 
of the Concert architecture. Interoperability at the kernel level is performed 
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on an object by object basis while interoperability at the object manager level 
is based on accessing collections of objects. 



Object 

Methods 
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windowQ [ 0 
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]b I 



. Query Processing 
over Global Schema 
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P 
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EUTT) 
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Generic Methods of 
Physical Design 

Interoperation with 
External Systems 



Concert 0_ 
Kernel System r 



Operating System 



Fig. 1. Overview of the Concert system architecture 



The Concert kernel system provides low-level database functionality such 
as storage management, low-level transaction management, and predicate and 
projection evaluation on single collections as basic query processing capability. 
It’s role is comparable to System R’s RSS [1], the Starburst Kernel [9] or the 
DASDBS Kernel [16]. In order to make data management efficient, the kernel 
is tightly bound to the underlying operating system providing multithreading 
and exploiting efficient secondary storage access using a memory-mapped buffer 
[2]. With respect to interoperability, the kernel’s capability exporting physical 
database design making it available for external systems is important. The kernel 
implements generic methods of physical design such as a Btree index, an inverted 
file index, an Rtree-like spatial index. In contrast to traditional systems, these 
indexes are not connected to a predefined type system, rather they rely on object 
properties. These object properties are made available through modules that can 
be plugged into the kernel providing access to external objects through method 
invocation. The Concert kernel and its interoperability evaluation is presented 
in Section 3. 

The Harmony object manager sits on top of Concert’s kernel system. At 
this level. Concert abstracts from single data objects. Harmony provides decla- 
rative access to collections of abstract objects according to the global schema. 
The main concept of this layer is a collection, which represents a homogeneous 
set of data objects. Its elements are accessed through an iterator interface. Fur- 
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thermore, Harmony defines a query algebra over the collections. Each collection 
of the object manager is in fact a queryable collection: it offers an evaluate 
method, through which a declarative query can be executed on its elements re- 
sulting in a new collection representing the result set. The global data model is 
object-oriented. The user can define object methods, whose object code is plug- 
ged into the object manager. Further, Harmony can interoperate with external 
systems through wrappers. They allow to export the query processing functio- 
nality of Concert to external repositories. The Harmony object manager and 
its interoperability evaluation is presented in Section 4. 



Example: To illustrate the interplay of the different layers of Concert, we look 
at the following simple example: Imagine a geospatial image archive. A satellite 
periodically generates new images together with descriptive information (e.g., 
current date and time, satellite position, etc.). These satellite images are stored 
in a huge tape archive. Storing the image triggers the creation of a corresponding 
new image object in Concert, which holds the meta information about the 
image and its position in the tape archive - but without the image data itself. 

First, we ask for the titles of all images. This query will be mainly processed 
inside the Concert kernel. The abstract image objects are of the plugged-in type 
SatelliteData (see Appendix), which has the general form RECORD ( SCALAR, 
SPATIAL, SPATIAL ). Harmony scans the image objects, asking for their first 
component, which the global schema declares as title attribute (see Appendix). 
The kernel calls the concept-typical method SUB_0BJECT() on the corresponding 
plugged-in type SatelliteData, and returns the extracted subobject. This is 
in fact a string. Harmony generates the result collection of string values. No 
further processing is needed. 

Second, we want to display a part of an image, whose title we know. Harmony 
now scans the image objects, asking for all such objects, where the first compo- 
nent holds the specified string- value (i.e., is the search title). This can again be 
evaluated inside the Concert kernel. On the retrieved abstract SatelliteData 
Harmony now executes the user-defined window () method. This method ex- 
tracts the part of the image we want to display. In order to do so, it falls back on 
the concept-typical methods SPLIT () and COMPOSE () of the SPATIAL compo- 
nent of each SatelliteData. These methods are provided by the kernel module 
Satellitelmage. They retrieve the needed parts of the satellite image from 
the external tape archive. This is an example for interoperability at the storage 
system layer. 

Third and last, there might exist a web page, on which weather data about 
Europe is published. Now we are interested in all satellite images of Europe, 
which are taken at a the same time, as this published weather information. 
This means, we need interoperability at the object manager layer. Harmony has 
to access the external web server via a corresponding wrapper. This wrapper 
transforms the contents of the page into a collection of weather data entries. 
Harmony afterwards joins these entries with the image objects stored inside 
Concert according to the time attributes. 
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3 An Abstract-Object Kernel System 

Under the term “Object-Relational”, database systems have become popular, 
that allow extensions to their kernel system. Such extensions are called blades, 
cartridges, or alike. They allow the DBMS to be extended by application speci- 
fic types and access methods. While implementing new types is relatively easy, 
new access methods is not. The new access method has to cooperate with the 
various components of the DBMS, such as concurrency control, data allocation, 
query evaluation, and optimization. This requires substantial knowledge of the 
DBMS internals. In contrast, the Concert kernel offers a limited, built-in set 
of physical design mechanisms in form of generic, trusted DBMS code provided 
by the DBMS implementor. Physical design is performed through relating new 
types to the fundamental concepts of the built-in physical design mechanisms. 



3.1 Concepts of Physical Database Design 

In [17], Stonebraker introduced the idea of a generic B-Tree that depends only 
on the existence of an ordering operation to index arbitrary data objects. Our 
Concert approach generalizes this idea by identifying all relevant concepts of 
physical database design and expressing them by the so called concept typical 
operations required to implement them over external data. The data objects are 
treated as abstract data types (ADT) in Concert, and physical database design 
is performed based on the operations of the ADT only. These ADT’s are user- 
defined and their methods are dynamically linked to the kernel at run time. In 
order to implement search tree access methods, a generic search tree approach 
(similar to GiST [6]) can be used as it integrates nicely into the Concert 
framework. 

The physical design concept behind Stonebraker’s generic B-Tree is that of 
data objects having a scalar property. Therefore, we call it the SCALAR concept 
and its concept typical ordering operation COMPARE. The comparison operation 
is sufficient to instantiate a generic B-Tree index without any further knowledge 
of the data objects. A second concept called RECORD concept allows to identify 
components of objects. A data object might be decomposed into object parts. 
This is exploited for example in a relational context as vertical partitioning. 
Its concept typical operation is the decomposition of objects into object parts 
called SUB_OBJECT. A third fundamental concept of physical database design is 
the one found for example in the information retrieval context, where objects 
are organized according to sets of object properties such as the index terms of 
the document object. The concept typical operations of this concept are the 
ones iterating over the set of properties. We therefore call it the LIST concept. 
The iteration allows the properties to be entered for example into an inverted 
file index. Finally, the last concept of physical database design, we identified in 
Concert is the one concerned with spatially extended objects and is therefore 
called the SPATIAL concept. It is used for expressing space-subspace relationships 
as they appear in GIS and CAD systems, but also in temporal applications. 
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These concepts are the means for interoperability at the Concert kernel 
level as discussed in Section 2. From a physical design point of view, storing the 
satellite images of our sample application in a tape archive while storing the 
corresponding metadata in the Concert storage component corresponds to a 
vertical partition of the image objects. The concept typical operation required is 
the decomposition of the RECORD concept. Therefore, the Application Program- 
mer can make the fact of the image residing in the tape archive known to the 
kernel via the RECORD concept by implementing the concept typical operation 
SUB_OBJECT. This operation is responsible for accessing the image on the archive. 

In addition to the four concepts SCALAR, RECORD, LIST and SPATIAL and 
their concept typical operations, three fundamental operations are required for 
all abstract objects. They are needed to pass abstract objects across system 
internal interfaces. If for example an abstract object is inserted into a Btree 
index, the object has to be recursively passed through the nodes of the tree. The 
kernel has to be able to COPY an object, which is the first and most important 
operation, that any object in Concert has to provide. Depending on the usage 
of the object, copying can be performed in different ways. If the object is to 
be passed to a function call within the same process, a shallow copy might be 
appropriate. If the object has to be stored in a database disk page, a full copy 
is required. In addition, this copy has to be linearized into a single continuous 
address space. Concert allows the copy operation to be driven by a set of copy 
flags making such distinctions. While the copy operation is specific for each 
object type, memory allocation has to be performed by the generic database 
code. In order to get to know the resource requirements, the C0PY_SIZE operation 
has to be provided by each object enabling the generic kernel to perform the 
necessary allocations. In order to actually perform a copy operation, additional 
resources such as temporary memory, network connections, file handles or alike 
might possibly be required. These resources have to be freed once the copy is no 
longer needed by the database. Therefore, the third operation required for all 
generic objects is the DELETE_AUX operation. As a consequence, the usual steps 
passing abstract objects around is performed as shown in Figure 2. 



s := CDPY_SIZE (o, copy_flags) ; 
new_o := allocate (s) ; 

COPY (o, new_o, copy_flags) ; 

. . do something with new_o . . 
DELETE. AUX (new_o) ; 



Fig. 2. Steps required to move an object around 



The Appendix shows the interface definition of the Concert kernel concepts 
and their concept typical operations. It is beyond the scope of this paper to give 
full details here. More information on Concert concepts in particular and the 
Concert kernel system in general can be found in [2,4,14] and [15]. 
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3.2 Performance Evaluations for Interoperability at Kernel Level 

It is clear, that the interoperability flexibility available in the Concert kernel 
allowing the kernel to access data from remote repositories has its price. This 
is not specific for Concert but rather inherent to any interoperable system. 
Because interoperability in the kernel is done at a very low level of abstraction, 
it is very efficient and measuring the local overhead therefore gives a minimal 
lower bound for the overhead to be expected in any interoperable system. 

Passing objects as parameters of procedure calls accross kernel modules is 
often used in kernel systems. Therefore, the main reason for the low-level in- 
teroperability cost in Concert is the fact, that the algorithm in Figure 2 in- 
volving method calls to abstract object is executed frequently. In systems with 
hard-coded object types, passing them as parameters of procedure calls can be 
specialized for the supported types and therefore can be coded more efficiently. 
Measuring the overhead of the generic algorithm compared with the hard-coded 
gives a good indication of the low-level interoperability cost. 

We identify three typical cases for base type objects: 

— The object type is a built-in type (such as longint or float) of the compi- 
ler, that the database system is compiled with. Copying the object can be 
done using compiler-generated object assignment code. This is the best pos- 
sible case for a hard-coded system and the most advantage compared with 
the generic case is expected. However, for these types, data independence 
can not be guaranteed, as the type representation is compiler and hardware 
dependent. 

— The object type is compiler independent, but of simple structure and of 
fixed size. Most standard database types in traditional database systems 
are of this category, such as INTEGER, NUMBER, CHAR(n). In some systems, 
aggregations of simple types such as ROW types in SQL3 fall into this category 
as well. Because their size is known a priori, and their representation is a 
continuous byte sequence, copying these objects corresponds to a simple 
memcopy operation. 

— The object type is of variable size, such as VARCHAR, BLOB or aggregations of 
object types. Their object size varies from object to object. Therefore, the 
object size has to be determined at run time and space allocation has to be 
performed dynamically. 

We do not discuss object type with complex structure, because there is virtually 
no cost difference between hard-coded and abstract objects. 

Using the Concert kernel system we measured the three typical cases com- 
paring hard-coded with abstract objects using generic algorithms. Figure 3 sum- 
marizes the results showing the copy time in nanoseconds on two different SUN 
Solaris system architectures. 

It does not surprise that moving more complex objects around is much more 
expensive than simple, small ones. It is clear that the more recent system ar- 
chitecture (UltraSparc 1) is substantially faster than the older one. From these 




284 L. Relly and U. Rohm 



SparcCenter 2000 UltraSparc 1 





hard-coded 


generic 


hard-coded 


generic 


int (32bit) 


446 


2108 


CO 


840 


NUMBER 


1345 


2877 


719 


1389 


VARCHAR 


11363 


13199 


3539 


4196 



Fig. 3. Comparison copying hard-coded objects versus abstract objects using generic 
algorithms (execution time in ns) 



measurements, we see that the interoperability costs, that is the difference bet- 
ween the hard-coded version and the generic version, is especially high for very 
simple objects (with an interoperability overhead of much more than 100%). 
The overhead is much smaller for larger objects (the VARCHAR object had 
an average length of 150 characters resulting in an overhead of approximately 
15%). For even larger and more complex objects, the overhead is only a few 
percent. While the first case due to the of lack of data independence is not very 
relevant for database interoperability, already in the second case the overhead is 
in the order of improvement of one hardware generation. Furthermore, the local 
interoperability cost for small objects is very small compared with the overall 
system cost. We conclude that building a system capable of low-level interopera- 
bility using generic instead of hard-wired algorithms has only a minimal impact 
on the overall system performance: the rapid hardware development makes low- 
level interoperability affordable. In the next section we concentrate on the higher 
level aspect of interoperability. 



4 An Abstract-Object Query System 

The usual understanding of interoperability between systems is quite narrow, 
addressing only SQL-interoperability. Commercial “SQL middleware” like the 
Informix Enterprise Gateway Manager, Oracle Transparent Gateways, or Sybase’ 
OmniSQL Server rely on a declarative surface of object managers. This means, 
they presume a declarative interface like ODBC or JDBC, which is already 
capable of executing SQL. A precondition, which is not feasible for non-database 
systems. Therefore, we will not discuss this, in fact, third level of interoperability 
in this paper. 

We rather address interoperability of the layer between the kernel system 
and the (declarative) user interface. Concert’s object manager only relies on 
single-scan collection interfaces. Especially, it does not require any further query 
capabilities of the storage system. However, if certain repositories provide such 
capabilities, they can be exploited. The object manager itself adds complete 
query functionality to the underlying storage kernel and exports a declarative 
interface to subsequent system layers. 

An additional important feature of Concert’s object manager is its dis- 
tributed peer-to-peer query execution. While the kernel offers typical storage 
system functionality for a single node system, the object manager layer is capa- 
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ble of distributing its query execution among several sites with Concert in- 
stances. As underlying infrastructure the CORBA middleware standard [10] is 
used. CORBA specifies a communication infrastructure for arbitrary heteroge- 
neous components of distributed systems. Furthermore, it defines a set of basic 
system services and standard components. One system service is of particular 
interest for Concert: the Object Query Service (OQS). As core of Concert’s 
object manager, we have designed and realized an implementation of the OQS, 
called Harmony. 



4.1 The Harmony Query Service 

Harmony’s query algebra operators are implemented by means of physical query 
operations. E.g., the join operator might actually be executed as simple nested 
loop join, or more sophisticated algorithms like a merge join or hind join. Har- 
mony evaluates a query by a sequence of such physical query operations which 
are partially ordered: an execution plan of the query. An execution plan has a 
tree structure, each node representing one physical operation. Edges connect 
the nodes according to their partial ordering. Each edge therefore represents the 
dataflow between subsequent operations. The leaves stand for the data sources 
needed for query evaluation. The root node represents the result of the query. 
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Fig. 4. Example query and corresponding execution plan in Harmony. 



An efficient implementation of the OQS is not straightforward. To come up 
with a realization of a CORBA-based query service whose performance is compe- 
titive, we have deployed classical database concepts, notably dataflow evaluation, 
bulk-transfer and intra-query parallelism: internally. Harmony evaluates a query 
in a dataflow manner according to an execution plan. As introduced in Section 2, 
each operation of the plan corresponds to a QueryableCollection object instan- 
tiated before evaluating the query. These objects actually implement the query 
algebra operators of Harmony. In addition to the usual operators, the query al- 
gebra includes meta operators: wrap, send and receive. Meta operators do not 
change the data stream, but perform some control function. 

Send / Receive The send/receive operators model the asynchronous set-oriented 
data transfer between sub-plans. They are used for performance optimizations 
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of distributed query processing and of interoperating queries, as we will show in 
Subsection 4.3. For a more in-detail description and evaluation of send/receive 
see [13]. 

Wrap wrap (to) abstracts from the underlying storage systems and therefore 
allows to integrate different storage systems on the logical level. In order to 
interoperate with a certain external system, an instance of the wrap operator for 
this storage system must be created. These instances correspond to the concept 
of wrappers [5] . Available wrappers are configured via Harmony’s global schema 
(cf. Appendix). Their code is dynamically loaded at runtime into Harmony as 
soon as the first access to a collection at the corresponding repository is required. 

A Harmony wrapper transforms the interface of the underlying storage sy- 
stem into the collection/iterator interface needed by Harmony for query evalua- 
tion. This includes type conversion into the CORBA type system. As Harmony 
provides all further query operations, a simple wrapper does not need to imple- 
ment own query functionality. The only functionality a wrapper must provide is 
to access data collections of its source. In the simplest case, this means the abi- 
lity to scan the source. A wrapper of a more sophisticated system may publish 
its query capabilities. Harmony can then decide to delegate whole sub-queries to 
such a wrapper, which will translate it into and execute it exploiting the local 
query language. 

We distinguish three different types of repositories which can be accessed by 
Harmony: 

— The Concert kernel storage system. This is the internal interface to Con- 
cert’s own kernel system. The major aspect here is the integration of the 
kernel’s abstract object types into the CORBA-based query service. This 
affects the type system and the user-defined methods, as we indicated in the 
satellite image example before. A further discussion follows in Subsection 4.2. 

— A data repository. As mentioned in the introduction, the main idea of Con- 
cert is to export database functionality to external systems. At the level of 
the object manager this is possible by providing a wrapper for such a simple 
repository, e.g., an CIS system or a file system. Via the wrapper, Concert 
is capable of evaluating declarative queries on data managed by such repo- 
sitories. Furthermore, externally stored data can be combined (joined) with 
objects managed by Concert itself (as shown with the third example query 
of Section 2). A similar approach is taken by Microsoft with OLE DB [3,12], 
but only for relational data and without an own storage system. 

— A database system. A special case of interoperability with external storage 
system is the integration of full-fledged database systems. First, such systems 
have own query capabilities which should be exploited by the corresponding 
wrapper. Second, the interface level at which Harmony now interoperates 
with the external system, is typically one above the layer of the object ma- 
nager. E.g., with a relational database, the wrapper must use the Embedded 
SQL API, which is already a declarative interface. As state-of-the-art data- 
base system do not provide lower APIs, interoperability is simply not possible 
at a lower abstraction layer. 
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4.2 Abstract-Object Types in Harmony 

As Harmony relies on the CORBA standard, it has to type the data with respect 
to the CORBA type system. Here, we have to distinguish two different abstrac- 
tion levels of data access in Harmony. So far, we concentrated on the higher level 
of collections of objects. All data access and query capabilities of Harmony are 
defined with respect to abstract (queryable) collection and iterator interfaces. 
These are CORBA objects, defined in CORBA IDL and fully embedded in the 
middleware infrastructure. 

At the abstraction level below, we are interested in the member objects of 
Harmony’s queryable collections. As they may be of arbitrary type. Harmony em- 
ploys the dynamic typing capability of CORBA, its any type. The actual content 
type of an any value is determined at runtime. For example, the data produced by 
the cordbms node in Figure 4 corresponds to tuples of the form (Description 
VARCHAR(IOO) ) . The wrap operation maps these data items into the CORBA 
type sequence<any>. At runtime, it contains values of type string. 

In order to benefit from the abstract object types of the kernel, we exten- 
ded CORBA’s own dynamic type to include Concert’s build-in types. For the 
CORBA system this is a user-defined opaque type, which Harmony provides the 
code for. This means. Harmony can also exploit the capabilities of concept typi- 
cal operations. Especially, this allows to combine interoperability at the physical 
and the logical level. The Harmony object manager can evaluate a query over 
collections of Concert objects, which are actually stored outside the kernel and 
accessed by the concept typical operations. 



4.3 Performance Evaluations for Interoperability at Object 
Manager Level 

In this subsection we are interested in quantifying the costs of interoperability. It 
is clear, that any access through the wrapper interface of Harmony must be more 
expensive than direct access to the underlying storage system. Interoperability 
does not come for free. 

Beside the performance difference of a three tier (client - Harmony- repo- 
sitory) against a two tier system approach (client - repository), another major 
performance bottleneck must be considered: as mentioned before, the Harmony 
object manager is capable of distributed execution of an query by exploiting a 
CORBA infrastructure. As any middleware system, CORBA introduces some 
overhead in order to provide the location and implementation transparency one 
expects from such a system. E.g., it has to marshal/demarshal each value trans- 
mitted via the middleware. This additional data conversions certainly affect the 
performance of a middleware-based system. 

The answer Harmony offers for this issue are its send / receive meta operators 
introduced above. These operators allow for intra-query parallelism and opti- 
mized bulk-data transfer between different query operations. While the receive 
operator consumes the set of intermediate results it got from the corresponding 
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set operator, the subsequent query operators can produce the next partial result 
in parallel. For further details see [13]. 

As a starting point to quantify the interoperability overhead of Harmony, 
we measured the access times for data access to a result set of 30000 string 
values stored in a relational database. We chose a database system, so that we 
could easily obtain a reference result. Therefore, we executed the SQL query 
shown in Figure 4 via an Embedded SQL/C program. The query was executed 
in two different system configurations: with the client and server on the same 
local machine and separated via a local area network (LAN). The results are 
compared to the execution times of Harmony. First, we submitted the same 
query to the RDBMS via Harmony’s RDBMS wrapper and retrieved the result 
set. Second, we introduced a send/receive pair above the wrap-operation, so 
that the external database produced its result items concurrently to the further 
processing in Harmony. For our experiments, the Harmony client retrieved the 
query results in chunks of 100 values. This is the same number of items as 
the ESQL/C program retrieved with one of its array fetches. The runtimes are 
presented in Figure 5. 





RDBMS 


Harmony 

wrap only with send/receive 


local 


26.22s 


391.63s 32.04s 


LAN 


24.19s 


366.45s 29.98s 



Fig. 5. Local/remote data access times (seconds) for Concert and a RDBMS. 



This experiment clearly shows a massive interoperability overhead for the 
“naive” execution of the query with Harmony. Without deploying any query op- 
timization techniques, data access through the Harmony wrappers is about 15 
times slower than direct access to the RDBMS (first two columns in Figure 5). 
This drastically changes, if we introduce intra-query parallelism in Harmony: 
data access through Harmony is now only about 20% slower as a direct database 
access via Embedded SQL/C. This result shows, that medium-level interope- 
rability is affordable, as long as it is combined with proven query optimization 
techniques. 

5 Conclusions 

In this paper, we showed, how interoperability between heterogeneous storage 
systems is achieved in our Concert system. We discussed two different levels 
of interoperability: between kernel systems, and on the object manager level. 
The main motivation is to be able to export database functionality to external 
systems. 

This is achieved in a very flexible way: the kernel allows to plug-in modules 
with user-defined types, which only have to provide the fundamental concept- 
typical operations. Above the kernel, the Harmony object manager includes a 
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wrap-operator in its query algebra. Instances of this operator, so-called wrappers, 
allow to interoperate with external repositories at the query-processing level. 
Their only required functionality is a simple-scan interface. Extensions at all 
levels are implemented in a plug and play like fashion exploiting the operating 
system capability of dynamic linking. 

We also presented first results of a cost evaluation of interoperability with 
Concert at the different levels. From our experiments, we can learn two con- 
clusions: First, the interoperability overhead increases with the abstraction layer 
at which interoperability takes place. Second, the overhead can be minimized by 
exploiting well-known database optimization techniques, like bulk-transfer and 
intra-query parallelism in the case of Harmony. Naive interoperability is very 
costly. An efficient implementation of an interoperability interface needs careful 
design and conscious deployment of proven database techniques. 

The ideas presented here are part of ongoing work. We are primarily inte- 
rested in further evaluating the interoperability costs at the different abstraction 
levels, especially in comparison with spatial data access modules of commercial 
systems. We also plan to investigate, how external data can be accessed fully 
dynamically via wrappers without prescribing the data format. 
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Interface Definition of the Kernel Concepts 



GONCEPT UNKNOWN 


C0PY_SIZE 

COPY 

DELETEAUX 


object, copy -flags 
source, target, copy -flags 
object 



GONCEPT SCALAR ISA UNKNOWN 
CDMPARElol, o2|-> {-1,0,1} 



CONCEPT RECORD ISA UNKNOWN 


SUB-OBJECT-SIZE 

SUB-OBJECT 


object, component 
object, component 


IN 

— >■ part 



CONCEPT SPATIAL ISA UNKNOWN 


OVERLAPS 

SPLIT 

COMPOSE 

APPROX 


object 1, object2 
object 
{ object } 

{ object } 


SCALAR-HASH 
— >■ { object } 

— >■ object 
— >■ object 



CONCEPT LIST ISA UNKNOWN 


OPEN 

FETCH 

CLOSE 


object 

cursor 

cursor 


cursor 

element 



Example Kernel Modules 



// "StringType" and "DateType" are predefined as SCALAR and 

// SPATIAL concepts 

CREATE CONCEPT Satellitelmage AS 
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SPATIAL 
WITH OVERLAPS 
SPLIT 
COMPOSE 
APPROX 



SI_Overlaps . so 
SI_SpIit . so 
SI_Compose . so 
SI_Approx . so 



CREATE CONCEPT SatelliteData AS 

RECORD ( StringType, DateType, Satellitelmage ) 
WITH SUB_OBJECT_SIZE := SD_GetSubObjSize . so 
SUB_OBJECT := SD_GetSubObj .so 



Example Global Schema 

repository WeatherServer := WebPageWrapper . so 
{ URL := "http://www.weathernet.com/..." }; 

// object types "Image", "Coordinate" and "Time" are predefined 
class Satellitelmage ( extent Satellitelmages ) 

attribute string title; 
attribute Time date; 
attribute Image picture; 

Image window ( x, y, w, h : integer ) := SI_window. so ; 
boolean is_over ( pos : Coordinate ) := SI_translate . so ; 

class WeatherMeasurement@WeatherServer ( extent Measurements ) 

attribute Time when; 

attribute Coordinate where; 

attribute int rainfall; 

attribute int airPressure; 
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Abstract. Topological relations are not well defined for raster repre- 
sentations. In particular the widely used classification of topological re- 
lations based on the nine-intersection [8,5] cannot be applied to raster 
representations [9]. But a raster representation can be completed with 
edges and corners [14] to become a cell complex with the usual topologi- 
cal relations [16]. Although it is fascinating to abolish some conceptual 
differences between vector and raster, such a model appeared as of theo- 
retical interest only. 

In this paper definitions for topological relations on a raster - using the 
extended model - are given and systematically transformed to functions 
which can be applied to a regular raster representation. The extended 
model is used only as a concept; it need not to be stored. It becomes thus 
possible to determine the topological relation between two regions, given 
in raster representation, with the same reasoning as in vector representa- 
tions. This contributes to the merging of raster and vector operations. It 
demonstrates how the same conceptual operations can be used for both 
representations, thus hiding in one more instance the difference between 
them. 



1 Introduction 

Topological relations are not well defined for raster representations. In particu- 
lar the widely used classification of topological relations based on the four- and 
nine-intersection [8,5] cannot be applied to raster representations [9]. This is 
due to the topological incompleteness of a raster: it consists, in the field view, of 
(open) two-dimensional cells only. In contrast, vector representations consist also 
of one- and zero-dimensional elements, used for the representation of boundaries, 
which close two-dimensional point sets and demarcate from their exterior. Bo- 
undary constructions in raster representations require the use of raster elements 
[13], although they are two-dimensional by nature. Two-dimensional boundaries 
contradict to topology, so they cause some well-known paradoxes. 

Kovalevsky has suggested that the raster can be completed with edges and 
nodes to become a full topological model [14]. In this representation, called 
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here a hybrid raster, topological relations are defined equivalent to a vector 
representation [16]. But the hybrid raster appeared as of theoretical interest only, 
mostly due to its additional and redundant memory requirements (see Section 
2.3). 

Here detailed definitions for topological relations on a raster - using the 
hybrid raster representation - are given and then systematically transformed 
to yield functions which can be used in a convolution operation applied to a 
regular raster representation. Hereby the hybrid raster is only used as a concept. 
It need not be stored and is only partially constructed during the execution of 
a determination of a topological relation. It becomes thus possible to determine 
the topological relationship of two regions, given in raster representation, by the 
four- or nine-intersection. 

A formal approach is used to understand the structure and the theory of an 
extended raster representation and its application for topological relations. The 
specification is written in a functional language. Pure functional languages [1], 
like Gofer [11], provide a useful separation of specification and implementation 
[10]. With executable specifications, the result is a provable code - in syntax as 
well as with test cases - with a clear semantic. Furthermore, such a specifica- 
tion is basis for iterative optimization; e.g. the Gofer code published here^ was 
optimized in several cycles of improvements. The value of such formal specifi- 
cations is recognized more and more. So Dorenbeck and Egenhofer presented a 
formal specification of raster overlay, with a generalization for polygons [3] . We 
also specify an overlay, but of an extended raster, deriving the same behavior of 
raster and vector representations. 

This contributes to the merging of raster and vector operations. It demon- 
strates how the same conceptual operations can be used for both representations, 
thus hiding in one more instance the difference between them. 

The paper is structured as follows. In Section 2 previous work is collected, 
regarding topological relations between regions, and hybrid raster representation. 
In Section 3 the raster representation is extended to a hybrid raster, and the 
combination of two raster images is presented to determine a four-intersection. 
It is also discussed how to optimize computations. An example in Section 4 shows 
the advantage of an executable specification. Finally a discussion sums up the 
results and perspectives (Section 5). 

2 Previous Work 

2.1 Topological Relations 

Egenhofer proposed a representation of topological relations between point sets, 
based on the intersection sets of such point sets [6,7,5]. Point sets in IR^ refer 
to Euclidean topology, with the Euclidean distance as a metric. The metric is 
needed to define a boundary of (open) sets. Distinguishing the interior X°, the 
boundary dX and the exterior X‘^ of a point set X, two point sets A and B may 

^ The complete code is available at our web-page. 
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have nine intersection sets, which form a partition of the plane. For describing 
topological properties the size of the intersection sets is irrelevant, only being 
empty or not is characterizing. 

For regular closed and singular connected sets - simple regions - even four 
intersection sets are sufficient, because the omitted five intersection sets do not 
vary. The sets can be ordered in a 2 x 2-array, the four-intersection 14: 

■r._(A°r\B° A°r\dB\ 

dAndBj ^ ^ 

The nine-intersection contains the other five sets, too. - Eight relationships 
between two simple regions can be characterized using this schema (Table 1). 



Table 1. The eight distinct four-intersections for simple regions, and the names of the 
characterized topological relations. 



(% 

%) 

Disjoint 



(% 0\ /-0 

1^0 1^-0 - 0 j 0 - 0 j 

Meet Overlap Equal 



- 0 \ /-0 0 \ 
0 - 0 ^ \^-.0 - 0 j 

Cover CoveredBy 



-0\ /-0 0\ 

0 tnj tn) 

Contain ContainedBy 



The found relationships were investigated and applied to spatial reasoning 
[4,12], with the interest to speed up spatial queries in GIS or in AI. They are 
an important improvement of vector representations, which base on point sets 
in IR^. 



2.2 Topological Relations and Raster Representations 

A raster representation is a two-dimensional array of elements with integer co- 
ordinates. Interpreting the raster elements as fields - instead of lattice points -, 
the raster is a regular subdivision of space into squares of equal size, resets (short 
form for ’raster elements’). - For the general principle it doesn’t matter how the 
raster is implemented (see e.g. [15]). But a comparison to vector representations 
directly shows that only (open) two-dimensional elements exist, and one- and 
zero-dimensional elements are missed. Boundaries of regions cannot be defined 
by infinite balls as in the Euclidean space. 

This problem was treated so far in two ways: 

— omitting boundaries, having only regions as open sets, as it is done in region 
based reasoning methods (e.g. [2]); 
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~ defining substitutes for one-dimensional boundaries, using raster elements 
and any arbitrary neighborhood definition [13]. 

The first solution allows the application of the nine-intersection only for its 
two-dimensional intersection sets, i.e. the region interiors and exteriors, which 
yields a subset of the relationships in Euclidean space [17]. The resulting four- 
intersection may not be mixed up with the four-intersection defined with bound- 
aries (Eq. 1). 

The second solution generates two-dimensional boundaries, resel chains or 
bands, either as interior boundaries or as exterior boundaries. Two-dimensional 
boundaries contradict to topology, so they cause some well-known paradoxes 
[13]. The result of intersecting the sets of interior, boundary and exterior raster 
elements depends heavily on the definition of the boundary (interior or exterior) . 
Even worse is the possibility of more than the eight four-intersections described 
in Table 1, simple regions presumed [9]. These intersections have no common- 
sense meaning; they appear as variations of the eight presented intersections and 
need special care. 



2.3 The Hybrid Raster Representation 

Kovalevsky has suggested that the raster can be completed with edges and nodes 
to become a full topological model, to be precisely: an abstract cell complex 
[14]. The only specialty of this cell complex is its regular structure (Figure 1). 
Generally all elements of a cell complex are called (2D-, ID-, 0D-)cells, but we 
will speak in the following of two-dimensional cells - identical with the resels in 
raster -, one-dimensional edges and zero-dimensional nodes. The union of edges 
and nodes will be called the skeleton of the cells. 




Fig. 1. A (regular shaped) cell complex, replacing a raster element of usual raster 
representations in the hybrid raster: each cell is closed by four edges and four nodes. 



In this representation, topological relations regard again to Euclidean space, 
and the four- or nine-intersection can be applied in full accordance to vector 
representations [16]. In vector representations these tests are expensive, requiring 
polygon intersection. In a hybrid raster representation the tests are simple to 
evaluate: two hybrid rasters (of the same resolution, same size and common 
origin), labeled by three values for interior, boundary and exterior, are overlaid 
by A (equivalent to fl in set denotation). Then the nine possible combinations 
can be accumulated in a histogram. Binarizing the histogram (= 0, > 0) yields 
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the nine-intersection. In a more sophisticated algorithm one would consider the 
dimension of the intersection sets, and reduce the overlay to the cells of the 
relevant dimension. 

Winter presented also a data structure to store and access the cells and their 
skeleton. If the raster is of size n x m, additional elements in a hybrid raster 
are (n -I- 1) * m horizontal edges, n * (m -I- 1) vertical edges, and (n -I- 1) * (to -I- 
1) nodes: the required memory space is of order 4 higher than for the raster. 
Another critical point of such data structures is the considerable amount of 
index transformations for each access. 

However, if the hybrid raster is used only to represent regions - as raster 
does -, and no lines or points, then the additional elements of the hybrid raster 
become totally redundant to the cells. The skeleton can be renounced from ex- 
plicit storage, applying dependency rules instead, which work locally. This paper 
will investigate these ideas, using a functional approach to specify semantically 
the rules and their application. 



3 Topological Relations in a Functional Extended Raster 
Representation 

The determination of the nine-intersection is simple in a raster representation, 
if the topologically completed raster is used (Section 2.3). But this does not 
seem practical, as the model includes not only the cells, but also the edges 
and the nodes; this would quadruple the storage requirement and also make 
computation four times longer. We will develop now a functional extension of 
the raster that fulfills all the conditions of a hybrid raster virtually, without 
explicit representation. The functions are specified in Gofer [11]. 



3.1 Specification of a Hybrid Raster in Natural Language 

The hybrid raster representation can be computed from the regular raster repre- 
sentation, i.e. the necessary information is already contained in the raster, and 
all additional elements are redundant. 

Assume an arbitrary region ~ without loss of generality let us confine our- 
selves to simple regions - given as the set of resels with value ’Region’, and the 
background resels have the value ’Empty’. These two values are mapped to the 
Boolean values true and false, to allow the regular logical operations. 

Cells: Cells are identical to resels. A 

Vertical edges: A vertical edge belongs to the interior of the region, iff the 
adjacent left and right cells are labeled as ’Region’. It belongs to the exterior 
of the region, iff the adjacent left and right cells are labeled as ’Empty’. It 
belongs to the right boundary, iff the adjacent left cell is ’Region’ and the 
right cell is ’Empty’, otherwise it belongs to the left boundary (Figure 2). A 
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Fig. 2. Classification of edges by the two adjacent cells (bright cells are ontside of the 
region, dark cells are elements of the region). An edge belongs (a) to the exterior or (b) 
to the interior, if both raster elements are homogenons, (c) and (d) to the boundary, 
if the valnes of the raster elements are different. 



Horizontal edges: A horizontal edge belongs to the interior of the region, iff 
the adjacent upper and lower cells are labeled as ’Region’. It belongs to 
the exterior of the region, iff the adjacent upper and lower cells are labeled 
as ’Empty’. It belongs to the lower boundary, iff the adjacent upper cell is 
’Region’ and the lower cell is ’Empty’, otherwise it belongs to the upper 
boundary. A 

To distinguish the orientation of the boundary is not necessary in the context 
of this paper. But it could get importance in other tasks like line following. 

Nodes: A node belongs to the interior of the region, iff all four adjacent cells 
are labeled as ’Region’. It belongs to the exterior of the region, iff all four 
adjacent cells are labeled as ’Empty’. Otherwise the node belongs to the 
boundary (Figure 3). A 




Fig. 3. Classification of nodes by the four adjacent raster elements (bright resels are 
outside of the region, dark resels are elements of the region). A node belongs to the 
exterior, if all resels are outside (a), to the interior, if all resels are element of the region 
(b), and to the boundary if the four resels are not homogenous; the given examples (c, 
d) are not complete. 



To transform the rules into a formal language, an identification of each single 
elementsis required. We define the following index schema: 

Cell index: Cells are indexed in the regular way of resels. A 



Vertical edge index: Vertical edges are indexed with the same index as the 
cell to their left. A 
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Horizontal edge index: Horizontal edges are indexed with the same index as 
the cell above. A 

Node index: Nodes are indexed with the same index as the cell left above. A 

Figure 4 shows that this indexing schema is indeed complete for the Euclidean 
plane and gives for each element of the representation a unique index. However, 
any subset of the plane will miss the edges and nodes at the left and upper border 
by this indexing schema. For that reason it is presumed that the subsets are 
chosen with a border of at least one resel width (’Empty’) around the represented 
region. 




Fig. 4. Indexing schema for egdes and nodes. 



3.2 Specification of a Hybrid Raster in a Functional Language 

In Gofer an array can be realized as a class of {bounds , [index : = value] }, 
where bounds are the lower and upper limit of indices, and the remainder is a 
list of associations between an index and a value. In the context of this paper 
arrays are two-dimensional and rectangular, indices are integer tuples, and the 
type of cells is Boolean: 

instance Arrays (Int,Int) Bool 

Let us extract an arbitrary 2-by-2 sub-array from a binary raster image, by 
applying the class method getSubMat: 

get22Mat image i j = getSubMat image ((i, j) , (i+1, j+1)) 

The sub-array contains the four resets (i,j), {i + 1, j), {i,j + 1), and (z -|- 1, j -|- 
1). In the following, they are referred to as patterns cIJ, cEast, cSouth, and 
cSouthEast, cf. Figure 5. 

All the following functions map this sub-array onto a Boolean. They represent 
the rules of Section 3.1: 
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Fig. 5. The names of resels/cells in a window at cell (i, j). 



cinterior mat22 = cIJ 

cExterior mat22 = not (cinterior ar) 

vinterior mat22 = cIJ && cEast 

vExterior mat22 = (not cIJ) && (not cEast) 

vBoundary mat22 = not (vinterior ar) && not (vExterior ar) 

and so on for horizontal edges and for nodes. The result confirms or falsifies the 
rule name; e.g. if vBoundary returns true then the vertical edge (i,j) belongs to 
the boundary. While the resets are binary, the skeleton elements are ternary. 

With the functions above the elements of a hybrid raster can be derived from 
a raster on demand at any raster position. That ability allows the construction 
of the hybrid raster on the fly during the overlay of two raster images. No storage 
of the results is specified for the functions. 



3.3 Determination of Topological Relations 

Testing for the intersection between boundary and interior of two simple regi- 
ons determines their topological relation. In a hybrid raster, the tests must be 
repeated for cells, for horizontal and vertical edges, and for nodes. With regard 
to the limited dimension of some intersection sets, some of these tests can be 
neglected. 

For two hybrid raster images (of the same resolution, the same orientation, 
and the same origin), only cells intersect with cells, only edges intersect with 
edges, and only nodes intersect with nodes. That is a consequence of the regular 
decomposition of the plane, and exceeds the usual properties of vector repre- 
sentations. Taking advantage from these properties, the four intersection sets of 
Equation 1 can be reformulated as: 
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Here the arguments a and b stand for two sub-arrays, one from each raster 
image, with the same index. That means that with this compact code at any 
raster position (i,j) the four intersection sets between two region interiors and 
boundaries are determined: 

f ourIntersectionIJ a b i j = [ii, ib, bi, bb] where 

ii = ii_intersect (get22Mat a i j) (get22Mat b i j) 

ib = ib_intersect (get22Mat a i j) (get22Mat b i j) 

bi = bi_intersect (get22Mat a i j) (get22Mat b i j) 

bb = bb_intersect (get22Mat a i j) (get22Mat b i j) 

The remaining task is to move the 2-by-2 sub-arrays over both rasters in parallel. 
So the determination of the four-intersection is reduced to a convolution: 

f ourintersection a b = ((map or) . trauispose) 

[ f ourIntersectionIJ a b i j I 

i<-[begRow .. endRow] , j<-[begCoI .. endCoI] ] 

In the code a test is added to guarantee the identical image sizes. Also the 
patterns begRow and endCoI are defined in the code, with exploitation of the 
outer band of ’Empty’ in both images. 

Let us consider the last function in more detail. Convolution yields a list of 
four-intersections for each raster position (right hand of the equation), which are 
realized as lists of four Booleans. Transposing this list of lists yields a list of four 
lists each containing all Booleans regarding one intersection set for the whole 
overlaid images. The map operation applies the argument - the or function - to 
all elements of the lists: we derive four Booleans for the global four intersection 
sets. 

Extension of the procedure to the nine-intersection is straight forward. 



3.4 Computational Improvements 

In functional languages, the optimization is easily performed - but it is not even 
necessary. Languages like Gofer are ’lazy’; they evaluate functions only when 
needed, and only to a degree that is needed. While lazy evaluation optimizes 
program execution of the Gofer interpreter, the effects must be made explicit for 
translation to standard programming languages. 

Partly the given Gofer code is already optimized: consider the limitation of 
evaluating intersection sets with hybrid elements of specific dimensions only. 
For example, ii_intersect evaluates only cells - no edges or nodes. That is 
sufficient because if the interior-interior-intersection set is not empty it must 
contain two-dimensional elements. - Open for optimization is the last function 
four Intersect. The or, mapped to a list of Booleans, is true if at least one 
element is true. In principle it is sufficient to stop evaluation of each intersection 
set when the first true result is found. 

Once optimization is done (and tested), the code can be translated into 
standard programming languages, like Pascal or G-l — h. 
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4 Examples 

Because Gofer is an executable (interpreted) language, one can run the code 
with some test cases. To generate such examples, first a constructor is called to 
deliver a raster image, initialized as ’Empty’: 

imgEmpty = binArray (-1) (-1) 2 3 False 

Note that the bounds yield a 4-by-5 array, where the usable indices 0...1 or 2 
guarantee the outer band of ’Empty’ resets. - With the same constructor now 
two rectangular regions are created. Each region is combined with the empty 
image, creating the two raster images imgA and imgB (Figure 6): 

boxA = binArray 0010 True 
boxB = binArray 0102 True 
imgA = imgEmpty // assocs boxA 
imgB = imgEmpty // assocs boxB 

More complex regions could be generated iteratively. Now we can formulate the 
query: 

? f ourintersection imgA imgB 

The result is: [False, False, False, True]. That means the only intersec- 
tion set of Equation 1 (here in linear order) that is not empty is the intersection 
between the two (one-dimensional!) boundaries. The topological relationship bet- 
ween region A and B must be meet therefore. 



Fig. 6. The regions A (in the left raster image) and B (in the right raster image) meet 
along an implicit one-dimensional common bonndary. 



5 Conclusions 

The systematic and conform extension of the topological relations, as defined 
by Egenhofer, from the vector representation to the raster representation can be 
achieved using the conceptual transformation of the raster representation into 
the hybrid raster, as a complete topological model. This seems not practical, 
but a careful examination shows that no representation for the hybrid raster 
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representation must be constructed, and the necessary parts can be computed 
on the fly from a regular raster representation. 

The approach to specify in a functional language yields a semantically clear 
piece of code that can be run with test application to demonstrate the correctness 
in the investigated test cases. The systematic development and the application of 
standard methods of program simplification and optimization leads from a con- 
ceptually simple and correct formalization to efficient operations, which can be 
coded in various languages. For example, a translation into C-| — h took only few 
hours including testing. Differences between Gofer specification and C-I--I- imple- 
mentation concern the conceptual change to an algorithmic language, and some 
adaptions to specific efficiency properties fo the target language. It is interesting 
to compare the codes. 

In the paper effects are not investigated that originate in resolution of vector- 
raster conversion. We do not claim that an operation on a pair of vector regions 
results in the same topological relation than applied on the rasterized regions. 
Instead we claim in this paper that the behavior of vector and raster represen- 
tation can be assimilated, by extending the raster with its skeleton. So far, the 
paper contributes to the merging of raster and vector operations. With the use 
the same conceptual operations in both representations, the difference between 
both can be hidden in one more instance. 

It is to expect that in principle the ideas are applicable to quad-trees, too. 
But one has to take care of neighboring quad-tree leaves of different size. The 
construction of that skeleton is open to formalization. Furthermore, translation 
of the given specifications into standard programming languages is open for 
further elaboration. Only then evidence can be given for time consumption of the 
algorithms. We expect that the requirements are not bad, because the complexity 
of the problem is 0(n * m) with the number n * m of raster elements. 
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Abstract. A Non-Manifold data structure for the modeling of 3D syn- 
thetic environments is proposed. The data structure uses a boundary 
representation (B-rep) method. B-rep models 3D objects by describing 
them in terms of their bounding entities and by topologically orienting 
them in a manner that enables the distinction between the object’s inte- 
rior and exterior. Consistent with B-rep, the representational scheme of 
the proposed data structure includes both topologic and geometric infor- 
mation. The topologic information encompasses the adjacencies involved 
in 3D manifold and non-manifold objects, and is described using a new, 
extended Winged-Edge data structure. This data structure is referred 
to as ” Non-Manifold 3D Winged-Edge Topology”. The time complexity 
of the newly introduced data structure is investigated. Additionally, the 
Non-Manifold 3D Winged-Edge Topology is being prototyped in a Web- 
Based virtual reality application. The prototype data consists of Military 
Operation in Urban Terrain (MOUT) data for Camp LeJeune, North Ca- 
rolina. The application is expected to be ideal for training and simulation 
exercises as well as actual field operations requiring on-site assistance in 
urban areas. 



1 Introduction 

This paper describes the research into the extension of the National Imagery 
and Mapping Agency’s (NIMA’s) current Vector Product Format (VPF)[1] by 
the Naval Research Lab’s Digital Mapping, Charting, and Geodesy Analysis Pro- 
gram (DMAP). This work has been carried on with the support from the Defense 
Modeling Simulation Office (DMSO) and NIMA’s Terrain Modeling Project Of- 
fice (TMPO). 

* This work was sponsored by the Defense Modeling Simulation Office (DMSO) and 
the National Imagery and Mapping Agency’s (NIMA) Terrain Modeling Project 
Office (TMPO), under Program Element 0603832D, with Jerry Lenczowski and Ron 
Magee as program managers. The views and conclusions contained in this paper are 
those of the authors and should not be considered as representing those of DMSO 
and NIMA. 
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VPF’s Winged-Edge topology, in its current form, is documented as not 
being capable of modeling a wide range of three dimensional objects that may 
be encountered in an integrated three-dimensional synthetic environment. This 
class of objects includes non-manifold objects and objects which may be trans- 
mitted and received through the Synthetic Environment Data Representation 
and Interchange Specification (SEDRIS)[2]. DMAP therefore proposes VPF-I-, 
an extension to VPF that provides for georelational modeling in 3D (including 
non-manifold objects) and which would benefit the Modeling and Simulation 
(M&S) community. 

2 Non-manifold 3D Winged-Edge Topology 

The data structure relationships of the Non-Manifold 3D Winged-Edge topology 
are summarized in the object model of Fig. 1. Figure 1 uses a modified version 
of Raumbaugh notation (shown in Fig. 5). References to geometry are omitted. 




Fig. 1. Non-Manifold 3D Winged-Edge Topology Object Model. 



2.1 Primitives 

The main VPF-I- primitives are: 
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— Entity node - used to represent isolated features. 

— Connected node - used as endpoints to define edges. 

— Edge - an arc used to represent linear features or borders of faces. 

— Face - a two-dimensional primitive used to represent area features such as a 

lake. 

Inside the primitive directory, a mandatory Minimum Bounding Box (MBB) 
table (not shown in Fig. 1 for clarity) will be associated with each edge and 
face primitive. Because of its simple shape, an MBB is easier to handle than its 
corresponding primitive. The primitives shown above have an optional spatial 
index. The spatial index is based on an adaptive grid-based 3D binary tree, which 
reduces searching for a primitive down to binary search. Due to its variable length 
records, the connected node table has a mandatory associated variable length 
index. 

The ring table identifies the ring forming the outer boundary and all internal 
rings of a face primitive. This table (along with the face table) allows the ex- 
traction of all of the edges that form both the outer boundary and the internal 
rings of a face primitive. 

The entity node and the external rings are not essential to the understanding 
of the 3D non-manifold data structure and will not be discussed further. For more 
information, the interested reader is referred to [1,6]. 

The object model of Fig. 1 introduces a new structure called EFaces to resolve 
the ambiguities resulting from the absence of a fixed number of Faces adjacent 
to an Edge. The EFaces structure describes a use of a Face by an Edge and 
allows maintenance of the adjacency relationships between an Edge and zero, 
one, two or more Faces incident to an Edge. Each ConnectedJIode is related 
to one Edge in each manifold object to which the Node is attached and to each 
dangling Edge connected to the Node. As shown in Fig. 1, each Edge is related 
to its start and end Nodes and to its first and last EFaces. Each EFace is related 
to its Face, the Next_EFace in the ordered circular linked list of EFaces that 
the EFace is a member of, and to the Next_Edge_on_EFace. The Face in turn is 
linked to a Ring, which is related to its starting Edge. 

2.2 Features 

Traditional VPF defines five categories of cartographic features: Point, Line, 
Area, Complex and Text. Point, Line and Area features are classified as Sim- 
ple Features, composed of only one type of primitive. Each Simple Feature is 
of differing dimensionality: zero, one and two for Point, Line and Area Fea- 
tures respectively. Unlike Simple Features, Complex Features can be of mixed 
dimensionality, and are obtained by combining Features of similar or differing 
dimension. 

The object model of the feature level of VPF-I- is shown in Fig. 2. VPF-I- adds 
a new simple feature class of dimension three. The newly introduced feature, 
referred to as 3D Object Feature, is composed solely of Face primitives. This 
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new feature class is aimed at capturing a wide range of 3D objects. The EFace 
Table is also added to the structural scheme. While the Ring Table provides a 
relationship between a Face and all the Edges that compose the Face’s Rings, 
the EFace Table provides a relationship between an Edge and all the Faces that 
meet at that Edge. 




New Table Abstract Class 



Fig. 2. VPF+ Feature Class Structural Schema 



Although 3D Objects are restricted to primitives of one dimension, 3D Ob- 
jects of mixed dimensionality can be modeled through Complex Features using 
Simple Features of similar or mixed dimensionality as building blocks. 

3 Performance Analysis 

A time performance analysis of the non-manifold winged-edge topology is per- 
formed in this section. The performance evaluation involves investigating the 
time complexity to implement nine access primitives. These access primitives, 
listed in Table 1, describe the retrieval of all topological adjacencies for each of 
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the Face, Node and Edge primitives. A data structure that stores only a subset 
of all possible adjacency relations between primitives yet can satisfy queries to 
all nine access primitives is said to be topologically sufficient. Our Non-Manifold 
Winged-Edge data structure satisfies queries to all nine access primitives [6]. 



Table 1. The Nine Basic Access Primitives 



Access Primitives (AP) 


Description 


API 


Given face i find all rii nodes around it 


AP2 


Given face i find all ei edges around it 


APS 


Given face i find b faces around it 


AP4 


Given node i find b faces around it 


AP5 


Given node i find all Ui nodes connected to it 


AP6 


Given node i find all ei edges connected it 


AP7 


Given edge i find its two extreme vertices 


APS 


Given edge i find all ei edges connected to it 


AP9 


Given edge i find b faces intersecting it 



The following notation will be used for the time complexity analysis: 

|APz| = Average time needed to implement access primitive APi. 

K = Average time needed to access a row of an existing table using its 

primary key. 

E j = Average number of edges around a face. 

E„ = Average number of edges around a node. 

Eg = Average number of faces adjacent to an edge. 

a = Average number of distinct objects connected to a node. 

The performance study follows the methodology outlined in [3, 4 and 5]. A 
lower bound on the implementation of any access primitive is K. This is achieved 
when an item is retrieved using the table’s primary key. For instance, given an 
edge id, the retrieval of its two associated entity nodes (access primitive AP7) re- 
quires an amount of time equal to K (referred to herein as constant access time) . 
This is because each row in the edge table stores an edge id (primary key) and 
pointers to its two connected nodes. Therefore, using the edge table’s primary 
key (edge id), the ids of the two associated connected nodes can be retrieved in 
an amount of time equal to K. The worst-case (upper bound) performance in 
the implementation of an access primitive occurs when a table is searched using 
a foreign key, which may require a linear scan of the whole table. Such a deg- 
radation of performance is never experienced by our non-manifold winged-edge 
data structure (see Table 3). 



The average time complexity of implementing each access primitive is gi- 
ven in Table2. Table 3 (column 2) shows more explicit expressions of the time 
complexities. 





310 



R. Ladner, K. Shaw, and M. Abdelguerfi 



Table 2. Average Time Complexity 



Access Primitives (AP) 


Description 


API 


|AP2| -t E/K 


AP2 


Fe Ef K/2 -t 3K 


APS 


|AP2| -t Ef \AP9\ 


AP4 


\AP6\ -t E„ \AP9\ 


AP5 


\AP6\ -t K E„ 


AP6 


K + a{ \AP9\ -t Fe |AP2|) -f a(F, Ef -1)K 


AP7 


K 


APS 


K + 2 \AP6\ 


AP9 


(Fe -f 1) K 



Using the above values for a, Fg, E/ and E„, average time complexities for the 
manifold case are derived in Table 3 (column 3). These expressions show that, in 
the manifold case, all nine access primitives can be performed in constant-time, 
on average. 

For the manifold case, o;=l and Fg=2 (assuming the existence of a universal 
face). Additionally, in the manifold case, Ey and E„ have been shown in [4] to 
be such that: 



Ef = Er,< 6(1-2(1-G)/U). (1) 

V and G represent the number of connected nodes and the number of holes 
respectively. The above expression implies that, in the manifold case, both Ej 
and E„ are approximately 6 on the average. 



Table 3. Average Time Complexity: Manifold and Non-Manifold Cases 



Access 

Primitives 


Average Time Complexity 
(Non-Manifold Case) 


Average Time 
Complexity 
(Manifold Case) 


API 


(Fe E//2 + Ef + 3)K 


15K 


AP2 


(Fe E//2 -t 3)K 


9K 


APS 


(3Fe E//2 + Ef + 3)K 


27K 


AP4 


K( 1 -t a(4Fe -h Fe^E//2 -t FeE/) -t E„(Fe -h 1)) 


51K 


AP5 


K(1 -t a(4Fe -h Fe^ E//2 -t Fe Ef) + E„) 


39K 


AP6 


K -t 0K(Fe'^ E//2 + FeEf +A Fe) 


33K 


AP7 


K 


K 


APS 


3K -t 2aK(Fe^ E//2 + F,Ef +4 Fe) 


67K 


AP9 


(Fe -t 1) K 


3K 
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4 The VPF+ Prototype 

VPF+ is being prototyped in a Web-based virtual reality application for the 
Naval Research Laboratory at Stennis Space Center. In this application, a brow- 
ser is used to select a user-defined extent of terrain and known features existing 
within that extent. The VPF-I- database is queried and a 3D virtual world is 
generated using Virtual Reality Modeling Language (VRML). Functionality in- 
cludes the ability to walk or fly through the terrain, move around objects, enter 
buildings, display floor plans, etc. The application should be ideal for training 
and simulation exercises as well as actual field operations requiring on-site assi- 
stance in urban areas. 

The data flow in the creation of the VPF-I- database is shown in Fig. 3. 
The prototyped data consists of Military Operations in Urban Terrain (MOUT) 
Data for Camp LeJeune, North Carolina. The elevation data is DTED Level 
5, providing one meter elevation post spacing. The feature data consists of the 
coordinates (latitude and longitude) for the footprints of various buildings, the 
coordinates of the centerline of each of various roadways, and the coordinates of 
point features, all with descriptive attributes. 

An interim elevation model was obtained using Arcinfo to produce a Triangu- 
lated Irregular Network (TIN) of the elevation data, using the lines forming the 
footprints of the feature buildings as constraints. The original terrain elevation 
data contained over 90,000 elevation points for an area of only approximately 
600 meters square. TINning reduced the total elevation points to approximately 
400, greatly improving performance. Since this geographic area is known to be 
relatively flat, the remaining elevation points were considered adequate to ap- 
proximate the terrain. Using the buildings’ footprints as constraints guaranteed 
the existence of nodes conforming to the coordinates of each of the footprints 
and also guaranteed a uniformly flat terrain under each of the buildings. 

As a final preprocessing step, Arcinfo was used to convert the TIN into an 
Arcinfo Net file containing primitive data for all nodes, edges and faces in the 
terrain. Additionally, nodes, edges and faces were added for each building in the 
feature data, since only primitives for the buildings’ footprints existed. Roads, 
previously existing only as line features defined by centerlines, were widened to 
their appropriate width, and nodes, edges and faces were added. 

Library models were used to represent designated point features, such as 
lampposts and park benches. VPF-I- data structures were then populated with 
all primitives and VPF-I- topology using software tools specifically written for 
that purpose. Additional software tools were developed in Java to extract data 
from VPF-I- tables for rendering into a 3D synthetic environment with VRML. 

Although development of the prototype is not yet complete, sample virtual 
worlds have been tested. Fig. 4 below shows one such world including the terrain 
area and some of the buildings, roads and point features in the data set, but 
without application of textures. Each figure shows the same world, but from a 
different viewpoint. Included is the successful generation of buildings’ interiors 
and the ability to ’’walk” into the building. 




312 



R. Ladner, K. Shaw, and M. Abdelguerfi 





Fig. 3. Overview of Data Conversion and Rendering 
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Fig. 4. Virtual World Samples 
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A concrete class showing only its 

class name. Concrete classes are instantiated. 



A 



Shaded boxes indicate ^street classes that are ruDt instantiated. 



A 



Inheritance (is-a) is shown by the triangle. 
Class A is either an object of class B or an 
object of class C. 



Every class A object is related to exactly 1 class B object. 



Every class A object is related to zero or mote 
class B objects. Every class B object is related to 
zero or one class A object. 



Every class A object is related to 2 class B objects. 
Every class B object is related to 2 or more ordered 
class A objects. 




2+ {ordered} 2 



A 4 4 B 



Each (A,B) object pair is associated 
with one class C object. 

The diamond shows an aggregation (has-a) 
relationship. The arrow indicates a one way 
relationship. A is composed of one object 
of class B. A knows what object of class B it 
is associated with. B does not know what 
object of class A it is associated with. 




Fig. 5. Appendix A. Modified Rambaugh Notation 
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Abstract. Reliable digital terrain modelling necessarily needs to be ba- 
sed upon a set of rules which first prescribe how to discretise the con- 
tinuous terrain surface, and second define the assumptions determining 
the subsequent interpolation. Unresolved problems in automated DTM 
analysis are mostly due to a failure of today’s approach to fully appre- 
ciate the implications of these rules. The functional limitations imposed 
by monolithic GISs are identified as a major reason for this failure. The 
Pluggable Terrain Module (PTM) presented in this paper is proposed to 
overcome these drawbacks by using a modular approach. The basic idea 
is to move the processing of terrain information from GISs to the PTM; 
communication with other software components is specified by a set of 
interfaces, shifting terrain modelling towards distributed and interopera- 
ble geoprocessing. Based on the notions of the OpenGIS Geodata Model 
(OGM), the PTM design is proposed and its implications and benefits 
to terrain modelling are discussed. 



1 Motivation — Digital Terrain Modelling until Now 

Because the terrain surface is of great influence for most environmental processes 
and many human activities, digital terrain models (DTMs) provide an impor- 
tant basis for many types of analysis within GIS. Likewise, DTM analysis and 
characterisation is one of the major tasks of GIS applications. 

Despite the improvements in automated DTM analysis made over the past 
years, there are a number of, as yet, unresolved problems. These are based upon 
both, conceptual problems which result when considering DTMs as models of a 
continuous surface, and drawbacks of today’s approach to modelling within GIS. 
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1.1 Conceptual Problems when Using DTMs as Basis for Spatial 
Modelling 

First, despite being models of continuous surface form, DTMs are mostly created 
from discrete elevation values. This causes conceptual difficulties because proper 
and logically consistent DTM analysis requires continuous surface representa- 
tion.^ 

A second conceptual problem is the question what each sample data element 
represents within the model. This problem focuses on semantic uncertainties 
arising from questions such as: 

— What is the area represented by each sample data element, e. g. does a given 
gridded elevation matrix represent a point grid or a cell grid, or are there 
any forms of implicit interpolation? 

— Are any additional assumptions made about the spatial relationships bet- 
ween recorded sample values, e.g. are irregularly distributed points related 
to each other only based on proximity? 

— Which is the information content of each sample data element, e. g. does it 
carry only height information or also further information such as representing 
a peak, a pass, or part of a slope break line? 

Third, DTMs implicitly model at a certain scale. Hence, the information 
derived is relevant to the scale implied by the model. Since this scale is often 
arbitrary and not necessarily related to the scale of analysis, derived results may 
not always be appropriate. [6] 

Finally, although the importance of quality information to spatial modelling 
and decision making is theoretically well known, most DTM applications neglect 
the management of error and uncertainties inherent in the model. 



1.2 Drawbacks of Today’s Approach to Modelling within GIS 

Currently, most of the available GISs provide certain data models and functio- 
nality to support the implementation of a spatial modelling task. Since GISs can 
not be infinite in complexity, they offer a limited selection of functionality that 
is considered to be a flexible and comprehensive tool set. Thus, the translation 
of a spatial model into a respective digital representation is limited by the infor- 
mation system’s capabilities. It is not surprising that modelling mostly occurs 
with the help of few but common functions. Consequently, the phenomena to 
be modelled have a much lower influence on the choice of the functions than 
the available GIS and its limitations. Since the limitations of monolithic GISs 
hinder fully considering the conceptual problems mentioned above, appropriate 
environmental modelling with available GISs is a difficult task. 

^ Such difficulties may be interpreted as communication problems between different 
information communities (data producer and data user communities) who describe 
terrain - or, to be more general, geographic information - in different ways.[l] 
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1.3 Modular Approach for Digital Terrain Representation and 
Processing 

Since monolithic GISs can not provide all the functionality needed to perform 
sound applications of DTMs, it is logical to process the digital terrain represen- 
tation outside the GIS. This calls for a modular approach. For such an approach, 
the so-called Pluggable Gomputing Model as proposed by the OpenGIS Guide 
[1] forms an excellent conceptual basis. In this concept, Tool Services, implemen- 
ted as so-called Pluggable Tools, provide sets of functionalities to fulfil specific 
tasks. Well-defined and commonly known and accepted interfaces allow exchan- 
ging of messages and data objects with other Tool Services or other software 
components. 

In this paper, a modular approach to process terrain information is proposed 
which overcomes the conceptual problems mentioned above. Being based on the 
concept of Pluggable Tools, our approach follows the notions of distributed and 
interoperable geoprocessing. First, an essential model for digital terrain models 
is proposed, followed by a discussion of its implications in the context of environ- 
mental modelling. After these preliminary considerations, the Pluggable Terrain 
Module (PTM) is presented. It is shown how its concept derives from the Virtual 
Data Set (VDS, [4]). The internal design of the module is explained, its benefits 
are highlighted. 

2 Essential Model 

The purpose of a DTM is to provide a sound digital description of a portion of the 
earth’s surface. Based on this understanding of digital terrain representations, 
we suggest that the essential model^ of a DTM consist of the triple {D, {rules}, 
V) [5] where 

— D denotes the spatial extent of the area of interest (i.e. the portion), 

— {rules} is a set of rules which define the information content of the sam- 
ple data set and the assumptions made about the true surface (i.e. sound 
description), 

— F is the range of the terrain representation. 

D is a usually rectangular area of the earth’s - or another planet’s - surface, 
given in some geographic reference system. Except for the task of joining two or 
more different terrain models, D does not cause conceptual problems. 

The same holds true for the value type of V, because as long as the modelling 
of errors and uncertainties is neglected elevation can be represented as a simple 
scalar. However, the value type of F becomes much more complex (e.g. intervals, 

^ According to the object-oriented modelling approach proposed by [2], the term ’es- 
sential model’ denotes model design at the highest level of abstraction; the essential 
model provides a description of some real or imaginary situation by purposefully 
considering entities and phenomena sitnated into a context, rather than by trying 
to describe all of some snpposedly objective reality. 
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tuples, or probability functions) when quality issues are introduced. Since this 
paper focuses on the integration of quality information in terrain representation 
and analysis, and not on the assessment and representation of quality itself, a 
broad discussion of V is beyond the scope of this paper. 

The set of rules determines how the data that support the terrain represen- 
tation have to be sampled and arranged and how elevation values and other 
topographic information is to be derived, i.e. how the actual terrain surface re- 
presentation is to be built upon the data. These tasks are often considered to be 
mere questions of surveying techniques and instruments, sampling density, and 
algorithmic and computational efficiency. The conceptual problems mentioned 
above, though being inherent to the rules, are less frequently discussed and very 
rarely taken into account in DTM applications. Especially the effects from the 
absence of appropriate assumptions are barely perceived. Therefore, the rules 
need further discussion. 

As terrain is a continuous surface form, its appropriate representation by 
digital means necessarily involves two steps: 

— Discretisation, when sampling the data, 

— simulation of continuity by interpolation. 

In the context of digital terrain modelling, each of these two steps raises a 
question that seems simple, but that is of substantial importance: 

— How should the terrain surface be discretized? 

This question asks about the information content of the sample data set. 
Should the sample consist only of elevation measurements, for example, or 
should it include further terrain-specific information, such as rivers, breaks 
in slope lines, etc.? And second, what should its modelling scale be? 

The discretisation has serious implications for the features represented by the 
DTM and their levels of detail, because information which is not explicitly 
indicated by the sample data can not be modelled (nor extracted) reliably 
(Fig. 1). 

Though measurement is a crucial source of uncertainty, a review of the actual 
data collection (in the literal sense of measurement of the sample data) is 
beyond the scope of this paper (for a detailed discussion of this task see [5]). 

— How should the surface be interpolated? 

As the real terrain surface is not a mathematical surface, we are forced to 
make assumptions about the true surface. Reasonable assumptions could be 
that the surface must be Gl-continuous^ except where break lines occur or 
that the surface must not contain sinks where not explicitly indicated. 

A direct consequence of interpolating based on assumptions is that the re- 
sulting DTM bears distinct properties. In the above example, one property 
of the resulting surfaces would be that a derivative exists for every point on 
the surface (expect at break lines), or, in the latter case, that the resulting 
model is hydrologically sound. 

® ’Geometric continuity’: for every point of the surface one and only one tangent plane 
exists. 
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Fig. 1. Reliable modelling of a topographic feature. The rigid structure of the regu- 
lar point raster is not able to model the stream as reliably as irregularly distributed 
and triangulated points. However, the triangulated irregular network (TIN) does not 
guarantee an appropriate modelling; smart placement of the data points as well as a 
triangulation that follows geomorphologic constraints are essential. 



As assumptions are, by their nature, uncertain to some degree, further un- 
certainty is introduced in this step. This uncertainty is (mostly) independent 
from the sample data and may be denoted as ’semantic uncertainty’. 



3 Implications on Modelling Terrain Surfaces 

3.1 Information Content of Sample Data Set 

The base data (consisting of geometric, semantic, and quality information) im- 
plies the information content as well as the scale of the model (where ’scale’ 
refers to level of detail): 

— The information content has a direct influence on the reliability of the mo- 
del. First, the more information the modelling is based upon, the less vague 
the result of the interpolation step, namely the shape of the digital terrain 
surface, will be (where ’more information’ does not refer to an increased num- 
ber of samples, but to more meaningful data samples) . Second, as mentioned 
above, reliable feature extraction means that there must be some evidence in 
the base data for the results. Only topographic information that is contained 
by the base data can be made derivable with the help of interpolation and 
extraction methods. 

— Extraction of scale dependent topographic information requires that the 
’scale of analysis’ agrees with the ’scale implied by the model’. If for instance 
a spatial model calls for sinks, the term ’sink’ will always be (explicitly or 
implicitly) defined at a certain scale, i.e. a sink must have a certain mini- 
mal size to be considered. Consequently, the terrain model must contain all 
sinks of that minimal size, and the extraction methods must be aware that 
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smaller sinks - if represented by the terrain model ~ are to be ignored, i.e. 
generalised.^ 



3.2 Interpolation Scheme Properties of the Terrain Representation 

The digital terrain surface generated by the interpolation has distinct properties. 
Digital terrain models are abstractions of the true terrain - not the true terrain 
itself. Therefore, to avoid semantic ambiguities, the interpolation schemes to- 
gether with the underlying assumptions must be explicitly stated. 

It is argued that such semantic ambiguities may lead to choosing inappro- 
priate terrain analysis methods. Take hydrologic modelling as an example: Cur- 
rent algorithms for automated drainage network extraction implicitly assume 
an idealised water discharge of zero volume. This assumption will work well if 
the DTM used is hydrologically sound. If this is not the case, for instance if 
small sinks occur, the idealised approach is inappropriate. The extraction will 
yield interrupted and disconnected watercourses unless the approach is adjusted, 
that is by making different assumptions about the rate of stream flow. Hence, 
an explicitly stated interpolation scheme allows the comparison of the spatial 
modelling approaches with the properties of the DTM to ensure that the terrain 
representation supports the selected approaches. 



3.3 Quality (in the Sense of ’Fitness for Use’) 

Quality does not only express metric accuracy. It determines the usability of the 
abstraction underlying a digital terrain model by specifying for instance: 

— completeness: Does the model represent all required topographic informa- 
tion? 

— currency: Are base data and model current (though generally constant, ter- 
rain may change quickly in specific environments, for instance glaciers and 
dunes)? 

— attribute accuracy: Do the base data represent what they are supposed to? 

When terrain models are applied in spatial models, a major focus of quality 
reports is to express uncertainty. Uncertainty is inherent to all the data (persi- 
stent and virtual) . Additionally, uncertainty is introduced through inappropriate 
usage of data, methods, and models. To minimise uncertainties introduced wit- 
hin the scope of a spatial modelling project, the following requirements must be 
fulfilled: 

— The terrain model has to be suitable for the intended modelling (i.e. all the 
required topographic information has to be appropriately represented by the 
DTM). 

From the latter point, it follows that a denser sampling of base data does not neces- 
sarily make the results of the analysis more reliable. 
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~ Information about properties and quality of the terrain model have to be 
available independently from the actual topographic data. 

— The user of the DTM has to understand the terrain model, i.e. to know the 
abstraction as well as the related properties of the model. 

4 Pluggable Terrain Module (PTM) 

The above discussion leads to the following list of requirements to terrain mo- 
delling: 

— The real terrain must be appropriately represented, i.e. as a continuous sur- 
face, 

— quality information must be available, 

— metadata to avoid semantic ambiguities must be available, 

— appropriate modelling tools to derive topographic information must be avai- 
lable. 

As already stated, these requirements can not be fully fulfilled when model- 
ling terrain within a monolithic GIS. We therefore propose to perform terrain 
modelling and analysis within a module outside the GIS. This module holds 
the actual terrain data as well as comprehensive information about quality and 
semantics; it simulates the continuous surface with the help of sound modelling 
functionality, and makes topographic and meta-information available through a 
well-defined interface. In this way, the DTM becomes an autonomous module 
that is able to communicate with GISs. 

The proposed module simulates, on the one hand, a complete continuous 
surface with the help of base data and appropriate modelling methods, both 
hidden inside the module. On the other hand, it makes all derivable information 
available via an interface. Hence, it is in accordance with the notion of the Virtual 
Data Set® (VDS) as presented by [4]. The design of the module, described in 
detail in the following sections, builds upon the OpenGIS Data Model (OGM, 
[1]). Its function in the context of interoperable geodata processing coincides 
with the Pluggable Tool in the Pluggable Gomputing Model environment [1] - 
thus the name Pluggable Terrain Module (PTM). 

® The basic idea is to extend sample data with methods to provide any derivable or 
predictable information. Instead of transforming original data to a standard format 
and storing them, the original data are enhanced with persistent (i. e. explicit) me- 
thods that only will be executed upon request. [4] In other words, the data exchange 
is not specified by a standardised data structure (e. g., a physical file format) but 
by a set of interfaces. These interfaces provide data access methods to retrieve the 
actual data values contained in a data set or a query result. An application that uses 
a VDS does therefore not ’read’ the data from a physical file (or query database), but 
will call a set of corresponding methods defined in the VDS which return the data 
requested. [5] As its name indicates, a virtual data set contains virtual data. Virtual 
data is information which is not physically present, that is, which is not persistent. 
This data is computed upon request at run time. [4] 
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4.1 Internal Design 

The PTM consists of 3 components (see Fig. 2): 

— the Property Set, 

— the Specification Schema, 

— and the Identification Schema. 



Property Set. The terrain module is basically a collection of identified pro- 
perties. Geospatial properties consist of geometry and semantic, the latter ex- 
pressed through attributes. To each geospatial property one or more quality 
properties are attached. 

— Geospatial Properties: 

Geospatial properties represent the (geospatial) phenomena and entities mo- 
delled by the module. Geospatial properties may be persistently stored data 

- in which case we speak of supporting geospatial properties - or persistently 
stored methods®, i.e. virtual data in the sense that it is derived upon request 

- therefore virtual geospatial properties. 

— Quality Properties: 

Quality property tuples describe quality aspects of the terrain module. They 
may be related to single geospatial properties, to groups of geospatial pro- 
perties, or to the whole terrain model. Quality properties may be persistently 
stored (if related to persistent data), or virtual, i.e. persistently stored me- 
thods for assessing the respective quality property (if related to virtual data). 



Specification Schema. The specification schema exposes the exact specifica- 
tions of all data elements and methods (i.e. of all properties) constituting the 
module. The schema itself has three components: 

~ A Geometry Schema which exactly specifies all the geometric structures used, 
as well as the spatial reference system(s) referred to. 

— An Attribute Schema which exactly specifies all properties. 

Each geospatial property is specified with its exact property name - in the 
case of virtual data together with the name of the corresponding method -, 
its value type - which in case of virtual data is at the same time the return 
value type of the respective method -, and its quality properties. 

The quality properties are specified with the exact property names and value 
types. Specification must also be provided for the method to be used for 
quality assessment and the reference system to be referred to. 

— A Method Schema which contains specifications of all the methods used, 
including all the methods involved in the generation of virtual data and 
methods for quality assessment. 

® These persistent methods basically correspond to what is denoted by ’stored func- 
tions’ from [1]. 
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Fig. 2. Internal design of 
the Pluggable Terrain 
Module (PTM). ‘PNA^T- 
pairs’ are ‘Property 
Name/Value Type-Pairs’ 
as described by [1], The 
dashed-dotted lines indi- 
cates the components 
considered to form the 
metadata. Otherwise, we 
use conventions for dia- 
grams proposed in [2]. 
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Identification Schema. The Identification Schema is basically an explanation 
of the module. It serves to avoid misunderstandings and to help understand the 
intended meaning of the terrain representation, i.e. what topographic informa- 
tion does the module contain, and which it does not. The identification schema 
itself consists of four parts: 

— Lexical Identification (’Lexical Semantics’, [1]) 

Contains the dictionary of terms used in the specification schema. The lexical 
identification must provide sufficiently rich definitions and examples so that 
there will be no ambiguity concerning the meaning of the terms used. Put 
in more concrete terms, the lexical identification must explain all property 
names (of both geospatial and quality properties) and all value types. 

— Use Identification (’Use Semantics’, [1]) 

Provides details on how to exploit the PTM. Included are the intended uses 
for the information as the terrain model was generated (if any) as well as 
options and limitations for obtaining the module. 

— Module Identification (’Project Semantics’, [1]) 

Describes how the DTM was conceptualised and how it was generated. It 
must contain information concerning: 

— Where: What is the region of the earth covered (i.e. physical extent)? 

— Who was responsible for data collection and DTM generation, who dis- 
tributes the data. 

— What phenomena are modelled in the terrain module, and what is their 
approximate scale. The underlying assumptions and conceptualisations 
must be included. This item explains the geospatial properties of the 
terrain model. 

— How were the data acquired, and how was the terrain model generated? 
— History: When did data collection and terrain model generation take 
place? What transformations occurred, and when? 

— Quality Identification 

Detailed description of the quality information available about the terrain 
model. This serves to catalogue the quality information available and helps to 
avoid misleading interpretations. Included must be the following information: 
— Who was responsible for the quality assessment. 

— When was quality investigated? 

— What are the quality property names and value types? 

— How was quality assessed? 

— What is the reference system referred to! 



4.2 Continuous Terrain Surface Representation 

The actual terrain representation is realised within the property set based upon 
the geospatial properties. One way to model a continuous surface is, for in- 
stance, to cluster or arrange the supporting geospatial properties (that is, the 
persistently stored sample data) into a pattern of non-overlapping geometries. 
This step is usually denoted as ’tesselation’. For each tile of the tesselation. 
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elevation is then modelled by means of an explicit and persistent interpolation 
scheme. While the sample data define elevations only for the supporting geospa- 
tial properties, the domain of the model is extended to the entire tesselation tiles 
by interpolation. 

Being part of the necessarily explicit interpolation scheme, the tesselation 
is likely to be stored persistently. It is, however, also possible to compute the 
tesselation only upon request, that is, that the tesselation shows virtual beha- 
viour. The decision, whether the tesselation shall be persistent or not is left to 
the implementation (and, in the end, is likely to result as a matter of efficiency). 

Interpolation does not necessarily need to be preceded by tesselation, as may 
seem suggested by the above example. However, the point is, that the proposed 
approach allows the terrain surface to be specified directly from the sample data, 
thereby avoiding lossy transformations of the original data. 

4.3 PTM Query 

Being based on the notion of Virtual Data Sets [4], the main idea underlying the 
approach taken here is that the query and exchange of derivable terrain data 
is not specified by standardised data formats but by a set of interfaces. These 
interfaces provide the methods necessary to retrieve the actual information re- 
presented by the terrain model. For this purpose, a PTM needs to compute 
- if the requested information is not persistent - and to expose its geospatial 
property values at locations specified by applications. This requires the specifica- 
tion of persistent methods which, upon request, compute the virtual geospatial 
properties. The proposed PTM design fulfils this requirement by means of the 
method schema which is part of the specification schema. (Consequently, the 
method schema contains the specifications of the query interface) . 

Therefore, PTMs lend themselves to implementation within distributed com- 
puting environments as objects or services which can be queried by applications 
requesting terrain information 

5 Conclusions 

5.1 Benefits of the Proposed Approach 

— The PTM presents the abstraction of the true terrain as a continuous surface. 

~ With the terrain surface being specified directly from the sample data, trans- 
formations of the original data set are avoided, thereby preventing that infor- 
mation is lost and uncertainty introduced by (unnecessary) transformations. 

— The assumptions the terrain representation is implicitly based upon are made 
explicit by means of the persistent methods. Likewise, metadata is provided 
to detail the structures and terms used. Together, this information essentially 
contributes to prevent ambiguities in data interpretation. 

— The major uses of metadata are to help the assessment of fitness for use and 
to provide the information needed to acquire, process and properly inter- 
pret data. The proposed PTM design allows the metadata to be separately 
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available. However, this requires that the persistent methods are designed to 
provide quality information for each derived geospatial property. 

— The PTM basically includes the topics of metadata identified in the ’Content 
Standard for Digital Geospatial Metadata’. [3] 

~ Unlike DTMs in a conventional sense, a PTM comprehends more than the 
terrain surface description itself. Based on persistent methods, it also repre- 
sents derivable terrain properties. By embedding the functionality needed 
for terrain analysis, a PTM provides an application with proper tools^ and, 
thereby, prevents it from choosing inappropriate approaches to handle the 
terrain information. 

— Being themselves geospatial properties - virtual geospatial properties - the 
methods provided for terrain analysis, or strictly speaking their results, are 
necessarily accompanied by quality properties. Therefore, at least on a de- 
sign level, no terrain information should be derivable without corresponding 
quality descriptions. 

— A PTM may provide several different terrain representations for a single 
sample data set. That is, different abstractions of the true terrain surface 
can be simultaniously realised (multiple terrain representations). 

— The notion underlying a PTM implies a shift of responsability from data 
users to data producers. [5] It is argued that the expert knowledge forming 
a prerequisite to terrain representation and analysis is available more in the 
data producer domain than in the data user communities. It therefore seems 
obvious to leave the specification of such tasks to terrain modelling experts 
(which, following the above argumentation, are likely to be part of the data 
producer domain). Data users on the other hand get the terrain information 
they need by PTM query through well-defined interfaces. Thus, they can 
concentrate on their application and need not be aware of low-level details 
of the actual implementation. However, they can query a PTM for such 
information. 

— Of course, it makes no sense to have data producers write the methods con- 
stituting a PTM from scratch for every implementation. Instead, predefined 
methods, e.g. from a library, could be used and specialised.® In that sense, a 
PTM offers an efficient implementation approach based on reusable software 
components. 

5.2 Drawbacks of the Approach 

— As quality properties are added to the geospatial properties, the data volume 
tends to be multiplied. 

— Computational and communication costs are substantial and may heavily 
impact runtime efficiency. 

^ That is, the actual analysis happens inside the PTM. The derived results can be 
quieried by the GIS trough well-defined interfaces. 

® For persistent methods to be shared, they need to be encapsulated (or, in terms 
of OpenGIS, to become well known, and their parameters to become Well-Known 
Structures (WKS, [1])). 
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6 Outlook 

The modular approach where information is computed on request from inside the 
module offers the usage of symbolic computation. Symbolic computation would, 
for example, allow the reduction of the numeric error introduced during compu- 
tation. However, as pointed out by [5], symbolic computation is still available 
only within specialized software packages such as Maple or Mathematica. 

With regard to full interoperability, technical questions need further discus- 
sion. Application of the module presented in an interoperable environment makes 
sense only if the PTM can be transferred to the client and executed in the client’s 
context. [5] 

Future research will concentrate on the implementation of a prototype PTM. 
Hereto, many unresolved problems, mainly concerning the representation and 
analysis of continuous terrain surfaces as well as the assessment and management 
of quality information, must be addressed. Research efforts in these directions, 
especially in the field of quality management, are currently underway at the 
Geographic Institute at the University of Zurich. 
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